1 Problem statement
Symmetries play a paramount role in nature and are foundational aspects of both theoretical physics and deep learning. Knowing symmetries of a physical problem is often the first step towards its solution. In machine learning, models that exploit symmetries of their data domain are most successful at their tasks, as exemplified by the success of convolutional neural networks for a variety of problemsGoodfellow et al. (2016). Recently, many deep learning papers have explored the concept of symmetry using tools from theoretical physics, e.g. Mallat (2016); Bronstein et al. (2017); Cohen et al. (2018); Higgins et al. (2018).
In most cases, these works assumed that the symmetries of the problem were manifest and built neural architecture that respected those symmetries, in order to learn more efficiently. In some situations, however, the symmetries of the problem are hidden from us, and much of the work is done to uncover those symmetries. A famous example is Kepler’s inference from astronomical data that planetary orbits form ellipses with the sun at one focus. The fact that orbits in an inverse square law of force generically close is a consequence of a subtle symmetry of the problem that gives rise to the conserved Laplace–Runge–Lenz vectorGoldstein et al. (2013).
Here, we present a data–driven way to learn such symmetries. While we concentrate here on models of Hamiltonian mechanics, we hope that the tools we develop will inspire research into symmetry learning more generally.
Learning a symmetry means learning a transformation from the original physical variables to a new set of variables in which the symmetry is manifest. Neural models describing bijective mappings are the subject of recent work on normalizing flows Rezende & Mohamed (2015); Dinh et al. (2014, 2016) and RevNets Gomez et al. (2017); Jacobsen et al. (2018). Normalizing flows are usually constructed to have tractable Jacobians. In Hamiltonian mechanics symplectic (or canonical) transformations have a special role Saletan & José (1998). Such transformations are volume preserving but have further restrictions (see equation 2 below), and so require new network architectures. Hamiltonian time evolution is itself a symplectic transformation, so these methods may be applied to neural variational inference with Hamiltonian Monte Carlo (HMC) Neal (2012), discussed in several recent papers Salimans et al. (2014); Wolf et al. (2016); Levy et al. (2017); Caterini et al. (2018); Hoffman et al. (2019). Note added: Following submission of this work, the closely related preprint Greydanus et al. (2019) appeared.
In this work, we will focus on integrable models, which have nontrivial symmetries, as a test case for symmetry discovery. The organization of the remainder of this paper is as follows. In the next section, we introduce some concepts from classical integrable systems that we will use. Section 3 introduces our new architecture and learning algorithm, while section 4 contains experiments on three integrable models. Finally, section 5 provides a discussion of the results. Supplementary details are contained in the appendices.
2 Classical integrable systems
2.1 Hamiltonian dynamics and canonical transformations
Classical mechanics is the realm of Hamiltonian dynamics Saletan & José (1998). In the simplest case that we address here, motion occurs in a phase space of positions and momenta . The dynamics is governed by Hamilton’s equations, which are derived from a Hamiltonian function . For these read:
For example, the harmonic oscillator with unit frequency is described by , and the equations of motion , describe circular orbits in the phase plane.
The skew-symmetric matrixis called a symplectic form, and allows one to define symplectic (or canonical) transformations of phase space, as those maps whose Jacobian matrix is an element of the linear symplectic group at each point of its domain:
Given scalar valued functions on phase space, their Poisson bracket is defined as . and is a symplectic invariant. With this notation, Hamilton’s equations are , and time evolution is itself a symplectic transformation Arnol’d (2013).
2.2 Integrable models
A conserved quantity of a dynamical system is constant in time, and thus Poisson–commutes with the Hamiltonian, and constitutes a symmetry of the problem. For example, in the celebrated Kepler problem describing the motion of two planets attracted by a gravitational force which depends only on the distance between the planets, commutes with the angular momentum, which generates rotations of the relative coordinate, and is a conserved quantity of the dynamics Goldstein et al. (2013).
Integrable systems are those which have a number of mutually commuting and independent integrals of the motion that equals , half the phase space dimension. The Liouville–Arnold theorem states that (compact) motion is confined to torii parametrized by angles and there exists a symplectic transformation from the original coordinates to new coordinates , where are called actions and are the conserved quantities of the problems Arnol’d (2013). In the action angle coordinates, equation 1 therefore reads:
where the transformed Hamiltonian
is independent of the angles. Finding explicit action-angle transformations is a challenging task and while a lot of progress has been made constructing integrable systems from algebraic or geometric principles Babelon et al. (2003), there is no general algorithm to construct higher integrals of the motion given an integrable Hamiltonian. Learning such a transformation is the goal of this work.
We will work with the Cartesian coordinates and denote by the symplectic map
For example, action-angle variables for the harmonic oscillator are the symplectic polar coordinates , so that is the identity, and the only conserved quantity is the energy. In general will be such that complex trajectories in the phase space get mapped to circles in the planes where is the angular coordinate and is half the squared radius of the circle.
In this work we will learn neural parametrizations of for three paradigmatic integrable models: (1) the Kepler model of two planets interacting with gravitational force Goldstein et al. (2013); (2) the Neumann model of oscillators with positions in constrained to the –dimensional sphere Babelon et al. (2003); (3) the Calogero-Moser (CM) model of a chain of particles with inverse square interaction potential Moser (1976). The Hamiltonians and their conserved quantities are described in appendix B.
3 Deep symplectic flows
We have reformulated the task of learning symmetries in integrable models as that of learning the map which transforms a circular trajectory to the complex trajectory of the original model . We now describe how to parametrize and learn such a transformation.
Here we will adapt recent results on normalizing flows to provide symplectic versions of popular invertible layers such as additive coupling Dinh et al. (2014)2016)
and invertible linear transformationsKingma & Dhariwal (2018). Our parametrization of is given by stacking blocks of these three layers. We now describe each layer.
3.1.1 Symplectic additive coupling
The additive coupling layer introduced in Dinh et al. (2014) partitions the inputs and outputs with , where the shift function NN is an arbitrary neural network. If we now identify subsystems as respectively, we have the following layer :
Symplecticity of the transformation imposes further irrotationality:
This constraint may be handled by setting , where is parametrized by a neural network. This gives the traditional leapfrog update used in HMC.
While conceptually simple, this approach is computationally expensive, requiring
time in the backward pass of the network. A cheaper approach is to use a multilayer perceptron with three layers and constrained weight matrices. In appendix A we show that equation 7 is satisfied if
We call this architecture irrotational MLP. The analysis can be generalized, but we took this simple architecture for most of our experiments. Geometrically, embeds into a higher dimensional space and maps the embedding back to the original space after it has been scaled by , whose sign controls whether the map is orientation preserving.
3.1.2 Symplectic Linear
The additive coupling leaves unchanged, and we introduce the symplectic linear layer to mix so that deeper additive couplings act on all phase space coordinates. To parametrize a symplectic matrix , we use the pre–Iwasawa decomposition de Gosson (2006):
To parametrize , we note that it is the realification of the unitary and can be written as a product of Householder reflections, parametrized by a vector :
and a diagonal matrix of phases
Note that the complexity of applying both and to a vector is . To keep the complexity of the whole layer we take Householder reflections and further take and .
3.1.3 Zero center
The zero center layer is defined by the transformation:
where is the mean given during training as the batch mean, and during testing by a weighted moving average accumulated during training, as in the batch normalization layer – see Dinh et al. (2016) for its usage in normalizing flows. ,
are learnable offsets. The zero center layer normalizes its input and allows one to study deeper architectures, and is a restricted version of batch normalization compatible with symplectic invariance. (The full version of batch normalization indeed scales by the variance of each feature and does not preserve areas.)
3.2 Learning algorithm
According to the discussion in section 2, the map is determined by requiring that the original trajectory is mapped to circles . If the trajectory is sampled at time steps , such minimizes the following loss, which encourages the distance from the origin of neighbouring points to be the same:
A non-invertible or non-volume preserving network could minimize equation 14 trivially by collapsing the trajectories to zero or very small volumes in the transformed phase space: the symplecticity of the transformation is essential.
We therefore consider a learning algorithm that takes as input a batch of trajectories and minimizes the loss above averaged over the batch. Practically, we compute the trajectories by solving the original equations of motions using a Runge-Kutta solver, and we perform stochastic gradient descent parameter updates using the Adam optimizer. We shuffle randomly the trajectories at every epoch, which ensures that we compare distant points inequation 14, so that all deviations from a circular shape are penalized in the same way.
for details of network architecture and other training hyperparameters.
We first discuss the representation capacity of deep symplectic flows. Figure 3 shows the pull-back of the trajectories under a network composed of blocks defined in section 3.1. The parameters in are learned by running the algorithm of section 3.2 to convergence and feeding it with a single trajectory sampled at time steps. This shows that our model and algorithm can learn the action–angle map for both periodic (Kepler and Calogero-Moser) and quasi-periodic (Neumann) models.
We next investigate the generalization of our learning algorithm. By this we mean how well the learned symplectic flow can map unseen trajectories to circles. We present here results for the Neumann model with oscillators. We consider a batch of trajectories, one for each radius of the sphere on which the positions of the oscillators move. Table 1 reports the loss of equation 14 evaluated over these trajectories for a symplectic flow learned by considering only the subset as training data. While the points in the training set have the smallest values compared to neighbouring radii, the fact that the loss is of the same order across this range of trajectories, indicates that the model generalizes beyond the training points. To further substantiate this claim, we show in figure 4 the pull-back of the trajectories under for both the trajectories seen and unseen by the training algorithm.
The map thus learned can be used as a generative process for the physical system trajectories, as illustrated in figure 2. Interestingly, this allows one to visualize the points in phase space that correspond to a given Liouville–Arnold torus. Further, by varying the circles radii one can interpret the effect of the learned symmetries on the system under consideration.
The learning algorithm discussed so far relies on being able to solve the equations of motion. This was done here by numerical integration, and adds questions of convergence and stability of the ODE solver on top of those of the learning algorithm. We remark that there two possible ways to improve this. The first is to exploit integrability of the models to solve the motion analytically, which is however not a trivial matter and typically requires uncovering a Lax pair formulation of the problem Babelon et al. (2003). The second is to use a different learning algorithm. For example, one could minimize a loss that encourages the transformed Hamiltonian to be angle-independent (recall equation 3), or one could minimize the Kullback–Leibler (KL) divergence between the canonical density associated to the transformed Hamiltonian and that of a base distribution. Both alternatives are analyzed in appendix D.
While we have concentrated here on integrable models, we expect our methods to be applicable to find almost conserved quantities in models close to integrable ones, such as the celebrated Fermi–Pasta–Ulam–Tsingou chains Gallavotti (2007). We thus expect the deep learning approach to classical mechanics presented here to be of practical relevance for solving physical problems, for example by finding integration schemes with smaller discretization errors.
- Arnol’d (2013) Arnol’d, V. I. Mathematical methods of classical mechanics, volume 60. Springer Science & Business Media, 2013.
- Babelon et al. (2003) Babelon, O., Bernard, D., Talon, M., Press, C. U., Landshoff, P., Nelson, D., Sciama, D., and Weinberg, S. Introduction to Classical Integrable Systems. Cambridge Monographs on Mathem. Cambridge University Press, 2003. ISBN 9780521822671. URL https://books.google.co.uk/books?id=c6JN6Gp4RBQC.
- Bronstein et al. (2017) Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, A., and Vandergheynst, P. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
- Cabrera et al. (2010) Cabrera, R., Strohecker, T., and Rabitz, H. The canonical coset decomposition of unitary matrices through Householder transformations. Journal of Mathematical Physics, 51:082101–082101, Aug 2010. doi: 10.1063/1.3466798.
- Caterini et al. (2018) Caterini, A. L., Doucet, A., and Sejdinovic, D. Hamiltonian variational auto-encoder. In Advances in Neural Information Processing Systems, pp. 8167–8177, 2018.
- Cohen et al. (2018) Cohen, T. S., Geiger, M., Köhler, J., and Welling, M. Spherical cnns. arXiv preprint arXiv:1801.10130, 2018.
- de Gosson (2006) de Gosson, M. Symplectic Geometry and Quantum Mechanics. Operator Theory: Advances and Applications. Birkhäuser Basel, 2006. ISBN 9783764375751. URL https://books.google.co.uk/books?id=q9SHRvay75IC.
Dinh et al. (2014)
Dinh, L., Krueger, D., and Bengio, Y.
NICE: Non-linear Independent Components Estimation.ArXiv e-prints, October 2014.
- Dinh et al. (2016) Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using Real NVP. ArXiv e-prints, May 2016.
- Gallavotti (2007) Gallavotti, G. The Fermi-Pasta-Ulam problem: a status report, volume 728. Springer, 2007.
- Goldstein et al. (2013) Goldstein, H., Poole, C., and Safko, J. Classical Mechanics. Pearson, 2013. ISBN 9781292026558. URL https://books.google.nl/books?id=bhVnngEACAAJ.
Gomez et al. (2017)
Gomez, A. N., Ren, M., Urtasun, R., and Grosse, R. B.
The reversible residual network: Backpropagation without storing activations.In Advances in neural information processing systems, pp. 2214–2224, 2017.
- Goodfellow et al. (2016) Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. Adaptive computation and machine learning. MIT Press, 2016. ISBN 9780262035613. URL https://books.google.co.uk/books?id=Np9SDQAAQBAJ.
- Greydanus et al. (2019) Greydanus, S., Dzamba, M., and Yosinski, J. Hamiltonian neural networks. arXiv preprint arXiv:1906.01563, 2019.
- Higgins et al. (2018) Higgins, I., Amos, D., Pfau, D., Racaniere, S., Matthey, L., Rezende, D., and Lerchner, A. Towards a definition of disentangled representations. arXiv preprint arXiv:1812.02230, 2018.
- Hoffman et al. (2019) Hoffman, M., Sountsov, P., Dillon, J. V., Langmore, I., Tran, D., and Vasudevan, S. Neutra-lizing bad geometry in hamiltonian monte carlo using neural transport. arXiv preprint arXiv:1903.03704, 2019.
- Jacobsen et al. (2018) Jacobsen, J.-H., Smeulders, A., and Oyallon, E. i-revnet: Deep invertible networks. arXiv preprint arXiv:1802.07088, 2018.
- Jaynes (1957) Jaynes, E. T. Information theory and statistical mechanics. Physical review, 106(4):620, 1957.
- Kingma & Dhariwal (2018) Kingma, D. P. and Dhariwal, P. Glow: Generative Flow with Invertible 1x1 Convolutions. ArXiv e-prints, July 2018.
- Levy et al. (2017) Levy, D., Hoffman, M. D., and Sohl-Dickstein, J. Generalizing hamiltonian monte carlo with neural networks. arXiv preprint arXiv:1711.09268, 2017.
- Mallat (2016) Mallat, S. Understanding deep convolutional networks. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2065):20150203, 2016.
- Moser (1976) Moser, J. Three integrable hamiltonian systems connected with isospectral deformations. In Surveys in Applied Mathematics, pp. 235–258. Elsevier, 1976.
- Neal (2012) Neal, R. M. MCMC using Hamiltonian dynamics. ArXiv e-prints, June 2012.
- Perelomov (1990) Perelomov, A. M. Integrable systems of classical mechanics and Lie algebras. Birkhäuser, 1990.
- Rezende & Mohamed (2015) Rezende, D. J. and Mohamed, S. Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on International Conference on Machine Learning-Volume 37, pp. 1530–1538. JMLR. org, 2015.
- Saletan & José (1998) Saletan, J. and José, J. Classical dynamics: A contemporary approach, 1998.
- Salimans et al. (2014) Salimans, T., Kingma, D. P., and Welling, M. Markov Chain Monte Carlo and Variational Inference: Bridging the Gap. ArXiv e-prints, October 2014.
- Tomczak & Welling (2016) Tomczak, J. M. and Welling, M. Improving variational auto-encoders using householder flow. CoRR, abs/1611.09630, 2016. URL http://arxiv.org/abs/1611.09630.
- Wolf et al. (2016) Wolf, C., Karl, M., and van der Smagt, P. Variational inference with hamiltonian monte carlo. arXiv preprint arXiv:1609.08203, 2016.
Appendix A Symplecticity of irrotational MLP
with , , where are resp.
weights, bias, activation functions and activations of the layer. The constraint of equation 7 then takes the form
Appendix B Hamiltonians and conserved quantities
We provide details of the three integrable models studied.
The Hamiltonian of the Kepler model is
As well as the Hamiltonian and the three components of the angular momentum , the Kepler model has an additional conserved quantity called the Laplace–Runge–Lenz vector Goldstein et al. (2013)
This gives a total of seven conserved quantities. However, a one dimensional trajectory in a six dimensional phase space can have at most five conserved quantities that specify the trajectory. Thus there must be two relations between these quantities. They are
The existence of five independent conserved quantities, two more than the three required for integrability, mean that the trajectories form closed curves rather than filling the Liouville–Arnold torii. This situation is called superintegrability.
The Neumann Hamiltonian is
An independent set of constants of motion are Babelon et al. (2003)
for all different. The property for is easily checked using . The Hamiltonian can be expressed as the linear combination
showing that motion is confined to the sphere .
The Calogero–Moser model is given by
The discussion of integrals of motion is facilitated by presenting the equations of motion in matrix form Perelomov (1990). Introducing the matrices
the equations of motion are equivalent to
From this point it can be shown that the quantities
are all conserved.
Appendix C Details of experiments
We present here details of the experiments of section 4.
In figures 2 and 3 we used frequencies for the Neumann model and couplings for the Calogero-Moser model. The network architecture used is composed of blocks of zero-center, linear symplectic and symplectic additive coupling layers of section 3.1, applied in this order. The linear symplectic layer has Householder reflections whose reflection vectors are initialized to be identical, so that their product is the identity. The irrotational MLP used in the additive coupling layer (see appendix A) has hidden dimension and activation functions.
The data reported in table 1 and figure 4 was produced by sampling at points solutions of the equations of motions between times and . Training was done with the Adam optimizer using a mini-batch size of . We trained for epochs using a piece-wise learn rate schedule with iteration step boundaries and values .
Appendix D Alternative algorithms
The algorithm of section 3.2 used numerical solutions of the equations of motion for a batch of initial conditions. This can be expensive for large and also introduces numerical errors and questions of convergence of the ODE solver. We here present an alternative procedure. Recall that the transformed Hamiltonian of equation 4 satisfies . We therefore could find by minimizing the expectation of the norm of . However, the natural expectation is with respect to the canonical density of the model, which is and itself unknown. We propose to replace the canonical density with the exponential density
where the numerical prefactor comes from the uniform density of the angles. This choice is motivated by the fact that integrable models are expected to be described by the generalized Gibbs ensemble when is large Jaynes (1957)
, and which corresponds to an exponential distribution for the actions. Therefore, we propose the following alternative loss:
where we estimate the expectation over minibatches.
For illustration, we present some results using this algorithm for the Neumann model. For simplicity, we trained the model by sampling uniformly in the angles but fixing a single value of the actions, effectively replacing the exponential measure in with a Dirac measure, which physically amounts to restricting sampling to a single Liouville–Arnold torus. The model parameters and network architecture are similar to those of appendix C. To test the algorithm, we solve the Neumann equations of motion for an initial condition corresponding to the value of the actions used for training, and check that the inverse map is able to trivialize the trajectory by mapping it to a circle. We stress that differently from section 4, here the trajectories are not inputs to the training algorithm, which is fully unsupervised. The result is presented in figure 5, and validates the procedure.
While not relying on numerical solutions of the equations of motion is appealing, we noticed that this algorithm takes a longer time to converge than that of section 3.1. This is partly due to the presence of a derivative in the loss, which increases the computational cost.
d.2 Normalizing flows and KL minimization
The algorithm of section D.1 relies on the exponential ansatz for the sampling measure, which lacking rigorous results about its validity, can be not optimal in practice. Also, while one learns directly the transformed Hamiltonian , one does not know how to sample efficiently from it. Both issues can be resolved if we take a learnable density
where is a normalizing flow Dinh et al. (2016). The base distribution can be taken to be the exponential distribution.
In this formulation, a natural loss to consider is the following KL divergence:
where is the canonical density associated to the Hamiltonian . We leave the numerical study of the minimization of this loss for future work.