Lagrangian Neural Networks

03/10/2020 ∙ by Miles Cranmer, et al. ∙ 9

Accurate models of the world are built upon notions of its underlying symmetries. In physics, these symmetries correspond to conservation laws, such as for energy and momentum. Yet even though neural network models see increasing use in the physical sciences, they struggle to learn these symmetries. In this paper, we propose Lagrangian Neural Networks (LNNs), which can parameterize arbitrary Lagrangians using neural networks. In contrast to models that learn Hamiltonians, LNNs do not require canonical coordinates, and thus perform well in situations where canonical momenta are unknown or difficult to compute. Unlike previous approaches, our method does not restrict the functional form of learned energies and will produce energy-conserving models for a variety of tasks. We test our approach on a double pendulum and a relativistic particle, demonstrating energy conservation where a baseline approach incurs dissipation and modeling relativity without canonical coordinates where a Hamiltonian approach fails. Finally, we show how this model can be applied to graphs and continuous systems using a Lagrangian Graph Network, and demonstrate it on the 1D wave equation.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


Lagrangian Neural Networks

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural networks excel at finding patterns in noisy and high-dimensional data. They can perform well on tasks such as image classification

(Krizhevsky et al., 2012), language translation (Sutskever et al., 2014), and game playing (Silver et al., 2017)

. Part of their success is rooted in "deep priors" which include hard-coded translation invariances (e.g., convolutional filters), clever architectural choices (e.g., self-attention layers), and well-conditioned optimization landscapes (e.g., batch normalization). Yet, in spite of their ability to compete with humans on challenging tasks, these models lack many basic intuitions about the dynamics of the physical world. Whereas a human can quickly deduce that a ball thrown upward will return to their hand at roughly the same velocity, a neural network may never grasp this abstraction, even after seeing thousands of examples

(Bakhtin et al., 2019).

The basic problem with neural network models is that they struggle to learn basic symmetries and conservation laws. One solution to this problem is to design neural networks that can learn arbitrary conservation laws. This was the core motivation behind Hamiltonian Neural Networks by Greydanus et al. (2019) and Hamiltonian Generative Networks by Toth et al. (2019). These are both types of differential equations modeled by neural networks, or “Neural ODEs,” which were studied extensively in Chen et al. (2018).

Yet while these models were able to learn effective conservation laws, the Hamiltonian formalism requires that the coordinates of the system be “canonical.” In order to be canonical, the input coordinates to the model should obey a strict set of rules given by the Poisson bracket relations:


where indicates a time derivative and the Lagrangian. The problem is that many datasets have dynamical coordinates that do not satisfy this constraint: is not simply mass in many cases. A promising alternative is to learn the Lagrangian of the system instead. Like Hamiltonians, Lagrangians enforce conservation of total energy, but unlike Hamiltonians, they can do so using arbitrary coordinates.

In this paper, we show how to learn Lagrangians using neural networks. We demonstrate that they can model complex physical systems which Hamiltonian Neural Networks (HNNs) fail to model while outperforming a baseline neural network in energy conservation. We discuss our work in the context of “Deep Lagrangian Networks,” (DeLaNs) described in Lutter et al. (2019), a closely related method for learning Lagrangians. Unlike our work, DeLaNs are built for continuous control applications and only models rigid body dynamics. Our model is more general in that it does not restrict the functional form of the Lagrangian. We also show how our model can be applied to continuous systems and graphs using a Lagrangian Graph Network in section 5.


Neural net Neural ODE HNN DeLaN LNN (ours)
Can model dynamical systems
Learns differential equations
Learns exact conservation laws
Learns from arbitrary coords.
Learns arbitrary Lagrangians
Table 1: An overview of neural network based models for physical dynamics. Lagrangian Neural Networks combine many desirable properties. DeLaNs explicitly restrict the learned kinetic energy, and HNNs implicitly do as well by requiring a definition of the canonical momenta.

2 Theory

Lagrangians. The Lagrangian formalism models a classical physics system with coordinates that begin in one state and end up in another state . There are many paths that these coordinates might take as they pass from to , and we associate each of these paths with a scalar value called “the action.” Lagrangian mechanics tells us that if we let the action be the functional:


where is kinetic energy and is potential energy, then there is only one path that the physical system will take, and that path is the stationary (e.g., minimum) value of . In order to enforce the requirement that be stationary, i.e., , we define the Lagrangian to be , and derive a constraint equation called the Euler-Lagrange equation which describes the path of the system:


Euler-Lagrange with a parametric Lagrangian. Physicists traditionally write down an analytical expression for and then symbolically expand the Euler-Lagrange equation into a system of differential equations. However, since

is now a black box, we must resign ourselves to the fact that there can be no analytical expansion of the Euler-Lagrange equation. Fortunately, we can still derive a numerical expression for the dynamics of the system. We begin by rewriting the Euler-Lagrange equation in vectorized form as



. Then we can use the chain rule to expand the time derivative

through the gradient of the Lagrangian, giving us two new terms, one with a and another with a :


Here, these products of operators are matrices, e.g., . Now we can use a matrix inverse to solve for :


For a given set of coordinates , we now have a method for calculating from a black box Lagrangian, which we can integrate to find the dynamics of the system. In the same manner as Greydanus et al. (2019)

, we can also write a loss function in terms of the discrepancy between

and .

3 Related Work

Physics priors. A variety of previous works have sought to endow neural networks with physics-motivated inductive biases. These include work for domain-specific problems in molecular dynamics (Rupp et al., 2012), quantum mechanics (Schütt et al., 2017), or robotics (Lutter et al., 2019). Other are more general, such as Interaction Networks (Battaglia et al., 2016), which models physical interactions as message passing on a graph.

Learning invariant quantities. Recent work by Greydanus et al. (2019), Toth et al. (2019), and Chen et al. (2019) built on previous approaches of endowing neural networks with physical priors by demonstrating how to learn invariant quantities by approximating a Hamiltonian with a neural network. In this paper, we follow the same approach as Greydanus et al. (2019), but with the objective of learning a Lagrangian rather than a Hamiltonian so not to restrict the learned kinetic energy.

DeLaN. A closely related previous work is “Deep Lagrangian Networks”, or DeLaN (Lutter et al., 2019), in which the authors show how to learn specific types of Lagrangian systems. They assume that the kinetic energy is an inner product of the velocity: , where is a -dependent positive definite matrix. This approach works well for rigid body dynamics, which includes many systems encountered in robotics. However, many systems do not have this kinetic energy, including, for example, a charged particle in a magnetic field, and a fast-moving object in special relativity.

4 Methods

Solving Euler-Lagrange with JAX. Efficiently implementing equation 6 represents a formidable technical challenge. In order to train this forward model, we need to compute the inverse Hessian

(we use the pseudoinverse to avoid potential singular matrices) of a neural network and then perform backpropagation. The matrix inverse scales as

with the number of coordinates . Perhaps surprisingly, though, we can implement this operation in a few lines of JAX (Bradbury et al., 2018) (see Appendix). We publish our code on GitHub111 .

Training details. For both baseline, HNN, and LNN models, we used a four-layer neural network model with 500 hidden units, a decaying learning rate starting at , and a batch size of 32. We initialize our model using a novel initialization scheme described in appendix C, which was optimized for LNNs.

Activation functions.

Since we take the Hessian of a LNN, the second-order derivative of the activation function is important. For example, the ReLU nonlinearity is a poor choice because its second-order derivative is zero. In order to find a better activation function, we performed a hyperparameter search over ReLU

(squared), ReLU, tanh, sigmoid, and softplus activations on the double pendulum problem. We found that softplus performed best and thus used it for all experiments.

5 Experiments

Double pendulum. The first task we considered was a dataset of simulated double pendulum trajectories. We set the masses and lengths to and learned the instantaneous accelerations over 600,000 random initial conditions. We found that the LNN and the baseline converged to similar final losses of and , respectively. The more significant difference was in energy conservation; the LNN almost exactly conserved the true energy over time, whereas the baseline did not. Averaging over 40 random initial conditions with 100 time steps each, the mean energy discrepancy between the true total energy and predicted was 8% and 0.4% of the max potential energy for the baseline and LNN models respectively.


(a) Error in angle


(b) Error in energy
Figure 2: Results on the double pendulum task. In (a), we see that the LNN and baseline model perform similarly in modelling the dynamics of the pendulum over short time periods. However, if we plot out the energy over a very long time period in (b) we can see that the LNN model conserves the total energy of the system significantly better than the baseline.

Relativistic particle in a uniform potential. The second task was a relativistic particle in a uniform potential. For a particle with a mass of in a potential and with , special relativity gives the Lagrangian . The canonical momenta of this system are , which means that a Hamiltonian Neural Network will fail if given simple observables like and . The DeLaN model will also struggle since it assumes that is second order in . To verify these predictions, we trained HNN and LNN models on systems with random initial conditions and values of . Figure 3 shows that the HNN fails without canonical coordinates whereas the LNN can work without this extra a priori knowledge, and learns the system as accurately as an HNN trained on canonical coordinates.

(a) HNN, arbitrary coords.
(b) HNN, canonical coords.
(c) LNN, arbitrary coords.
Figure 3: Results on the relativistic particle task. In (a) an HNN model fails to learn dynamics from non-canonical coordinates. In (b) the HNN succeeds when given canonical coordinates. Finally in (c) the LNN learns accurate dynamics even from non-canonical coordinates.

Wave equation with Lagrangian Graph Networks. Equation 6 is applicable to many different types of systems, including those with disconnected coordinates such as graph-like and grid-like systems. To model this, we now learn a Lagrangian density which is summed to form the full Lagrangian. This is similar to the Hamiltonian Graph Network of Sanchez-Gonzalez et al. (2019), or more specifically, the Flattened Hamiltonian Graph Network of Cranmer et al. (2020). For simplicity, we focus on 1D grids with locally-dependent dynamics (i.e., nodes are connected to their two adjacent nodes).

A continuous material can be described in terms of quantities , such as the displacement of a guitar string, at each gridpoint . One way to model this type of system is to treat the quantities on adjacent gridpoints as independent coordinates, and sum the regular LNN equation over all connected groups of coordinates. For gridpoints, the total Lagrangian is then:


where is the set of indices connected to . For a 1D grid where only the adjacent gridpoints affect the dynamics of the center gridpoint, this is . Again, we can solve dynamics by plugging the Lagrangian into eq. 6, now with as the coordinates:


where, e.g., . Note that the Hessian matrix is sparse with non-zero entries at “neighbor of neighbor” positions in the graph, which can make it much more efficient to calculate and invert. For example, in 1D it can be calculated with only 5 forward over backwards auto-differentiation passes and can be inverted in linear time.

We can think of this as a regular LNN where, instead of calculating the Lagrangian directly from a fixed set of coordinates, we are now accumulating a Lagrangian density over groups of coordinates. For a different connectivity, such as a graph network, or to approximate higher-order spatial derivatives for a continuous material, one would select based on the adjacency matrix for the graph. This model is a type of Graph Neural Network (Scarselli et al., 2008). The Lagrangian density itself, , we model as an MLP. Since we write our models in JAX, we can easily vectorize this forward model over the grid.

We consider the 1D wave equation, with representing the wave displacement: . For wave speed , the equation describing its dynamics can be written as: . The Lagrangian for this differential equation is: For the LNN to learn this, the MLP will need to learn to approximate a finite difference operator: for grid spacing (or any translation + scaling of this equation). We simulate the wave equation in a box with periodic boundary conditions, and learn the dynamics with this Lagrangian Graph Network, shown in fig. 4. The Lagrangian Graph Network models the wave equation accurately and almost exactly conserves energy integrated across the material.

Figure 4: Results on the 1D wave equation task, using a Lagrangian Graph Network. Here, the LNN models the Lagrangian density over three adjacent gridpoints along the wave, and this density is summed. Here, the waves make approximately three and a half passes across the 100 grid points over the time plotted. A movie can be viewed by clicking on the plot or by going to

6 Conclusion

We have introduced a new class of neural networks, Lagrangian Neural Networks, which can learn arbitrary Lagrangians. In contrast to models that learn Hamiltonians, these models do not require canonical coordinates and thus perform well in situations where canonical momenta are unknown or difficult to compute. To evaluate our model, we showed that it could effectively conserve total energy on a complex physical system: the double pendulum. We showed that our model could learn non-trivial canonical momenta on a task where Hamiltonian learning struggles. Finally, we demonstrated a graph version of the model with the 1D wave equation.

Acknowledgments Miles Cranmer would like to thank Jeremy Goodman for several discussions and validation of the theory behind the Lagrangian approach, and Adam Burrows and Oliver Philcox for very helpful comments on a draft of the paper.


Appendix A Solving Euler-Lagrange with Jax

Here we will present a simple JAX implementation of Equation 6. Assume that lagrangian is a differentiable function that takes three vectors as input and outputs a scalar. Meanwhile, q_t is a vector of the velocities () of q () and q_tt contains the accelerations (). The vector m represents non-dynamical parameters.

q_tt = (
 jax.numpy.linalg.pinv(jax.hessian(lagrangian, 1)(q, q_t, m)) @ (
    jax.grad(lagrangian, 0)(q, q_t, m)
  - jax.jacobian(jax.jacobian(lagrangian, 1), 0)(q, q_t, m) @ q_t

When this is called in a loss function with the LNN parameters as input: loss(params, …), one can write jax.grad(loss, 0)(params, …), to get the gradient.

Appendix B Example of Lagrangian Forward Model

To demonstrate how this may work if one has learned the exact function for , let us study an example of a ball falling in a gravity along the direction :


where is the local scalar gravitational field, and is the mass of the ball. To obtain the dynamics with our forward model, we calculate the required derivatives which gives us:


Thus, we find


which simply says that the particle accelerates downwards along , and moves at constant velocity along .

Appendix C Initialization

Regular optimization techniques such as Kaiming (He et al., 2015) and Xavier (Glorot & Bengio, 2010)

initialization are optimized so that in a regular neural network, the gradients of the output with respect to each parameter will have a mean of zero and standard deviation of one. Since the unusual optimization objective of a Lagrangian Neural Network is very nonlinear with respect to its parameters, we found that classical initialization schemes were insufficient.

To find a better initialization scheme, we conducted an empirical optimization of the KL-divergence of the gradient of each parameter with respect to a univariate Gaussian. We repeated this over a variety of neural network depths and widths and fit an empirical formula to our results.

To do this, we ran 2500 optimization steps with different initialization variances on each layer for an MLP of fixed depth and width. Biases were always initialized to zero. We recorded the optimized

values over 200 random hyperparameter settings with the number of hidden nodes between 50 and 300 and the number of hidden layers between one and three. Then, we used eureqa (Schmidt & Lipson, 2009) to find fit an equation using symbolic regression that predicted the optimal initialization variance as a function of the hyperparameters:


This model was optimized for 2 input coordinates and 2 input coordinate velocities which were sampled from univariate Gaussians.

During training, the hidden weight matrices had dimensions and each weight matrix was sampled from . A 100-node 4-layer model would have weight matrices with shapes and each one had initializations sampled from respectively.