 # Learning and Interpreting Potentials for Classical Hamiltonian Systems

We consider the problem of learning an interpretable potential energy function from a Hamiltonian system's trajectories. We address this problem for classical, separable Hamiltonian systems. Our approach first constructs a neural network model of the potential and then applies an equation discovery technique to extract from the neural potential a closed-form algebraic expression. We demonstrate this approach for several systems, including oscillators, a central force problem, and a problem of two charged particles in a classical Coulomb potential. Through these test problems, we show close agreement between learned neural potentials, the interpreted potentials we obtain after training, and the ground truth. In particular, for the central force problem, we show that our approach learns the correct effective potential, a reduced-order model of the system.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

As a cornerstone of classical physics, Hamiltonian systems arise in numerous settings in engineering and the physical sciences. Common examples include coupled oscillators, systems of particles/masses subject to classical electrostatic or gravitational forces, and rigid bodies. For integer , let and denote, respectively, the position and momentum of the system at time . Let and denote kinetic and potential energy, respectively. Our focus here is on classical, separable systems that arise from the Hamiltonian

 H(p,q)=T(p)+V(q). (1)

In this paper, we consider the problem of learning or identifying the potential energy from data measured at a discrete set of times. We assume is known. To motivate this problem, consider the setting of interacting particles in three-dimensional space; here . Suppose that we are truly interested in a reduced set of variables, e.g., the position and momentum of one of the particles. Let us denote the reduced-order quantities of interest by . The direct approach is to integrate numerically the -dimensional system of differential equations for the full Hamiltonian (1) and then use the full solution to compute . While such an approach yields numerical answers, typically, it does not explain how the reduced-order system evolves dynamically in time. If we suspect that itself satisfies a Hamiltonian system, we can search for a potential that yields an accurate, reduced-order model for . If is interpretable, we can use it to explain the reduced system’s dynamics—here we mean interpretability in the sense of traditional models in the physical sciences, which are written as algebraic expressions, not as numerical algorithms. We can also use the reduced-order model to simulate directly, with computational savings that depend on .

There is a rapidly growing literature on machine learning of potential energies in computational/physical chemistry, e.g.,

[3, 9, 10, 2, 1]. As in these studies, the present work uses neural networks to parameterize the unknown potential. A key difference is that, in the present work, we apply additional methods to interpret the learned neural potential. There exists a burgeoning, recent literature on learning interpretable dynamical systems from time series, e.g., [5, 4, 6, 12, 8], to cite but a few. We repurpose one such method—SINDy (sparse identification of nonlinear dynamics)—to convert the learned neural potential into a closed-form algebraic expression that is as interpretable as classical models. We apply only one such method for accomplishing this conversion into an algebraic expression; we hope that the results described here lead to further investigation in this area.

## 2 Approach

Assume where is a diagonal mass matrix. Then, from (1), we can write Hamilton’s equations:

 ˙q =M−1p (2a) ˙p =−∇V(q). (2b)

Let the training data consist of a set of trajectories; the -th such trajectory is . Here denotes a measurement of at time for fixed . We choose this equispaced temporal grid for simplicity; this choice is not essential. Because we treat the kinetic energy as known, we assume that the training data consists of (possibly noisy) measurements of a system that satisfies (2a). We now posit a model for that depends on a set of parameters . For instance, if we model using a neural network, stands for the collection of all network weights and biases. Then we use (2b) to form an empirical risk loss

 L(θ)=1RNR∑j=1N−1∑i=0∥∥ ∥∥pji+1−pjih+∇qV(qji;θ)∥∥ ∥∥2. (3)

Let denote the final time in our grid. Note that (3) approximates , the expected mean-squared error of a random trajectory assumed to satisfy (2a).

We model using a dense, feedforward neural network with

layers. Because we train with multiple trajectories, the input layer takes data in the form of two tensors—one for

and one for —with dimensions . The network then transposes and flattens the data to be of dimension . Thus begins the potential energy function part of the network (referred to in what follows as the neural potential), which takes a

-dimensional vector as input and produces a scalar as output. Between the neural potential’s

-unit input layer and -unit output layer, we have a number of hidden layers. In this model, we typically choose hidden layers to all have units where . As these architectural details differ by example, we give them below.

Note that the loss (3) involves the gradient of with respect to the input . We use automatic differentiation to compute this gradient. More specifically, in our IPython/Jupyter notebooks (linked below), we use the batch_jacobian

method in TensorFlow. This is easy to implement, fast, and accurate up to machine precision.

The trained network gives us a neural potential . To interpret , we apply the SINDy method . We now offer a capsule summary of this technique. Suppose we have a grid of points in . We use the notation . We evaluate on the grid, resulting in a vector of values that we denote by . We also evaluate on the grid a library of candidate functions for ; each such evaluation results in a vector that we take to be the -th column of a matrix . In , examples of candidate functions are or . In , an example is . Each candidate function is simply a scalar-valued function on .

Equipped with the vector and the matrix , we solve the regression problem

 V=Ξβ+ϵ (4)

for the vector

using an iteratively thresholded least-squares algorithm. The algorithm has one constant hyperparameter,

. The algorithm is then succinctly described as follows: (i) estimate

using ordinary least squares, and then (ii) reset to zero all components of

that are less than the threshold . Once components of are reset to zero, they stay frozen at zero. We then repeat steps (i) and (ii) until stabilizes to its converged value.

As shown recently , this algorithm converges in a finite number of steps to an approximate minimizer of . Here denotes the number of nonzero entries of . Hence, increasing the parameter leads to a more sparse set of coefficients . Once we have fit the regression model in this way, we obtain an interpretable model of , specifically:

 ˆV(x)=J∑j=1βjξj(x)+ϵ. (5)

If is highly sparse, most of the coefficients will be zero. Suppose that the candidate functions are well-known functions such as positive or negative powers of the coordinates of the input . In this case, the right-hand side will be a relatively short algebraic expression that is as interpretable as most potential energy functions routinely encountered in classical physics. The norm of here captures the error in this sparse approximation of . In general, one chooses to balance the sparsity of with the quality of the approximation .

## 3 Tests

We now describe a series of increasingly complex tests that demonstrate the proposed method. For each such model, we use either exact solutions or fine-scale numerical integration to create a corpus of time series measurements. Using the time series, we train a neural potential energy model, which we then interpret using SINDy. We use NumPy/SciPy or Mathematica for all data generation, TensorFlow for all neural network model development/training, and the sindyr package  in R to interpret the neural potential. In what follows, the mass matrix in (2) is the identity unless specified otherwise. In all cases, we train the neural network using gradient descent. We are committed to releasing all code/data at https://github.com/hbhat4000/learningpotentials.

### 3.1 Simple Harmonic Oscillator

The first model we consider is the simple harmonic oscillator () with Hamiltonian

 H(q,p)=p22+q22. (6)

Exact trajectories consists of circles centered at the origin in space. For training data, we use such circles; for , the -th circle passes through an initial condition . We include steps of each trajectory, recorded with a time step of , in the training set. Here our goal is to check how closely the neural potential can track the true potential . We take the neural potential model to have two hidden layers, each with units and activations. We train for steps at a learning rate of . Figure 1: For the simple harmonic oscillator (6), after adjusting a constant bias, the neural potential ˆV(q) closely matches the true potential V(q)=q2/2.

In Figure 1, we plot both the trained neural potential (in red) and the true potential (in black). When plotting , we have subtracted a constant bias (the minimum obtained value of ) so that the curve reaches a minimum value of zero. Note that this constant bias is completely unimportant for physics, as only appears in Hamilton’s equations (2) and the loss (3). However, the constant bias does lead us to include an intercept in the regression model (4), i.e., to include in the set of candidate functions for SINDy. We follow this practice in all uses of SINDy below.

We apply SINDy to the learned potential with candidate functions . In the following test, and in fact throughout this paper, we start with and tune downward until the error —between the neural network potential and the SINDy-computed approximation—drops below . We find that with , the estimated system is

 ˆV(q)≈β0+β2q2 (7)

with and . We see that closely tracks the true potential up to the constant bias term, which can be ignored. Figure 2: For the double well potential (8), the neural potential ˆV(q) trained on T1 closely matches the true potential V(q). This training set includes one high-energy trajectory that visits both wells. In red, we plot V(q) for q∈T1; in green, we plot V(q) for q∈[−1,3]∖T1. Potentials were adjusted by a constant bias so that they both have minimum values equal to zero. Figure 3: For the double well potential (8), we train neural potentials ˆV(q) using, in turn, the training sets T2 (left) and T3 (right). We plot in red ˆV(q) only for the values of q covered by the respective training sets; in green, we extrapolate ˆV(q) to values of q that are not in the respective training sets. Since T2 includes two trajectories, one from each well, the neural potential captures and extrapolates well to both wells. Conversely, because T3 only includes trajectories that stay in one well, the neural potential completely misses one well. Potentials were adjusted by a constant bias so that they all have minimum values equal to zero.

### 3.2 Double Well

Let us consider a particle in a double well potential ()

 V(q)=x2(x−2)2−(x−1)2. (8)

We take the kinetic energy to be . We now use explicit Runge-Kutta integration in Mathematica to form three training sets:

• Training set includes trajectories with random initial conditions chosen uniformly on , one of which has sufficiently high energy to visit both wells.

• Training set consists of trajectories, each of which starts and stays in an opposing well. The first trajectory has initial condition while the second has initial condition . These values are symmetric across , the symmetry axis of .

• Training set has only trajectories that stay in the left well only.

For each trajectory, we record points at a time step of . We take the neural potential model to have two hidden layers, each with units and activations. For each training set , we train for steps at a learning rate of .

We seek to understand how the choice of training set affects the ability of the neural potential to track the true potential (8). We plot and discuss the results in Figures 2 and 3. Overall, the neural potentials trained using and match closely—both on the training set and extrapolated to the rest of the interval . Clearly, the neural potential trained using only captures one well and does not extrapolate correctly to the rest of the domain.

Let denote the neural potential trained on . We now apply SINDy to the output of each only on its respective training set , with candidate functions . For reference, the ground truth can be written as . Adjusting downward as described above, we find with the following algebraic expressions:

 ˆV1(q) ≈−8.138+2.0008q+3.0009q2−4.0009q3+1.0001q4 ˆV2(q) ≈−6.061+2.0032q+3.0054q2−4.0148q3+0.9748q4 ˆV3(q) ≈−6.165+1.9909q+2.9991q2−3.9886q3+0.9955q4

Noting that the constant terms are irrelevant, we note here that all three models agree closely with the ground truth. The agreement between and the algebraic forms of and was expected. We find it somewhat surprising that SINDy, when applied to the output of on its training set , yields a quartic polynomial with two wells.

### 3.3 Central Force Problem

We consider a central force problem for one particle () with Hamiltonian

 H(q,p)=∥p∥22+∥q∥−1+(10−∥q∥)−1. (9)

The norm here is the standard Euclidean norm. Using explicit Runge-Kutta integration in Mathematica, we generate trajectory with random initial condition chosen uniformly on . Using this trajectory, we compute as well as . We save the trajectories at points with a time step of . We then search for a reduced-order () model with Hamiltonian

 H(r,˙r)=˙r22+˜V(r), (10)

where is a neural potential. We take the neural potential model to have two hidden layers, each with units. We train for steps at a learning rate of , first using exponential linear unit activations . We then initialize the neural network using the learned weights/biases and retrain using softplus activations —this activation was chosen to enable series expansions of , as described in greater detail below. Prior to retraining, we also change the network by adding an exponential function to the output layer—we incorporate this function to better model the steep gradients in the potential near and . When we retrain, we take steps at a learning rate of . We carry out the training in two stages because training directly with softplus activations and exponential output failed. Figure 4: For the central force problem (10), after adjusting a constant bias, the neural potential ˆV(r) closely matches the effective potential Veff(r).

For this problem, classical physics gives us an effective potential

 Veff(r)=r−1+(10−r)−1+ℓ2/(2r2). (11)

where is a conserved quantity determined from the initial condition. In Figure 4, we plot both the trained neural potential (in red) and the effective potential (in black). After adjusting for the constant bias term, we find close agreement.

We then exported the weight and bias matrices to Mathematica, forming the neural potential model

 ˆV(r)=W3ϕ(W2ϕ(W1r+b1)+b2)+b3. (12)

Unlike , the softplus activation is amenable to series expansion via symbolic computation. In particular, since we can see that the effective potential

is a rational function, we explored Padé expansions of

. These attempts were unsuccessful in the sense that we did not obtain models of that are any more interpretable than the compositional form of (12).

Turning to SINDy, we formed a library of candidate functions

 {1,r−1,r−2,r−3,(10−r)−1,(10−r)−2,(10−r)−3}.

Adjusting in the same manner described above, we find that with , the estimated model is

 ˆV(r)≈β0+β1r−1+β2r−2+β4(10−r)−1 (13)

Here , , , and . We see from the form of given above that and are both close to the ground truth values of . Note that for the trajectory on which the system was trained, we have . Hence has an error of less than . This demonstrates a successful application of SINDy to interpret the neural potential as a rational function; this interpretation of is itself close to .

### 3.4 Charged Particles in Coulomb Potential

We now consider two oppositely charged particles () subject to the classical Coulomb electrostatic potential. We take the mass matrix to be . The kinetic energy is . If we partition where is the position of the -th particle, then the potential is

 V(q)=−14π1∥q1−q2∥. (14)

Here we apply the Störmer-Verlet algorithm, a symplectic method, to generate trajectories, each with points recorded at a time step of . Each trajectory starts with random initial conditions chosen from a standard normal. For this problem, our goal is to use the data to recover . We train two different neural potential models with increasing levels of prior domain knowledge:

1. We first set up the neural network’s input layer to compute from the difference ; the neural potential then transforms this three-dimensional input into a scalar output. The neural network here has hidden layers, each with units and activations. Using only of the trajectories, we first train using the first points from each of the trajectories, taking steps at a learning rate of . Again restricting ourselves to the training trajectories, we then use the next points, followed by the next points, etc., each time taking steps at a learning rate of . As the training loss was observed to be sufficiently small (), we halted training. Figure 5: Here we plot both training (left) and test (right) results for the Coulomb problem (14). For both plots, we have subtracted a constant bias, the maximum value of the neural potential on the data set in question. These results are for a neural potential ˆV that is a function of the difference q1−q2 between the two charged particles’ positions; for each q in the training and test sets, we plot ˆV(q1−q2) versus r=∥q1−q2∥. We also plot the true potential (14) versus r. Both training and test plots show reasonable agreement between the neural potential and the ground truth.

In Figure 5, we plot both training (left) and test (right) results. The training results are plotted with the first points of the trajectories used for training, while the test results are plotted with the first points of the held out trajectories. For both plots, we subtracted the maximum computed value of (on each respective data set). In each plot, we plot (on all points in the training and test sets) versus .

Overall, we see reasonable agreement between the neural potential and the ground truth. Note that the neural network is essentially tasked with discovering that it should compute the inverse of the norm of . We suspect that this function of

may be somewhat difficult to represent using a composition of activation functions and linear transformations as in (

12). Despite training for a large number of steps, there is noticeable variation in neural potential values for large .

We now apply SINDy to (on the training set) using candidate functions . Adjusting as described above, we find with , the approximation

 ˆV(r)≈β0+β1r−1 (15)

with and . For comparison, the ground truth coefficient of is .

2. We then rearchitect the network to include a layer that takes the input and computes the norm of the difference ; the rest of the neural potential is then a scalar function of this scalar input. Here the neural network has hidden layers, each with units and activations. We train for steps with learning rate of . Note that here, for training, we use time steps of only trajectories. Figure 6: Here we plot both training (left) and test (right) results for the Coulomb problem (14). For both plots, we have subtracted a constant bias, the maximum value of the neural potential on the data set in question. These results are for a neural potential ˆV that is a function of the distance r=∥q1−q2∥ between the two charged particles; for each q in the training and test sets, we plot ˆV(r) versus r=∥q1−q2∥. We also plot the true potential (14) versus r. Both training and test plots show excellent agreement between the neural potential and the ground truth.

In Figure 6, we plot both training (left) and test (right) results. The training results are plotted with the first points of the trajectories used for training, while the test results are plotted with a completely different set of trajectories, each of length For both plots, we subtracted the minimum computed value of (on each respective data set). In each plot, we compute on all points in the training and test sets, and then plot these values versus .

To generate an interpretable version of (on the training set), we apply SINDy with candidate functions . Adjusting as described above, we find with , the approximation

 ˆV(r)≈β0+β1r−1 (16)

with and . This computed value of is less than away from the ground truth value of ; the error for the earlier approximation (15) was just over .

Incorporating prior knowledge that the potential should depend only on dramatically improves the quality of the learned potential. Essentially, we have eliminated the need for the neural network to learn the norm function. We outperform the results from Figure 5 using a less complex network, trained for fewer steps and a larger learning rate. Comparing with Figure 5, we see that Figure 6 features reduced variation in for large , and improved test set results as well.

## 4 Conclusion

We conclude that, for the examples we have explored, our approach does lead to accurate potentials that can themselves be approximated closely by interpretable, closed-form algebraic expressions. In ongoing/future work, we plan to apply the techniques described here to high-dimensional systems for which reduced-order (i.e., effective) potentials are unknown. We also seek to extend our method to quantum Hamiltonian systems. While we have focused here on clean data from known models, we are also interested in learning potentials from noisy time series. We expect that by adapting the method of , we will be able to simultaneously filter the data and estimate an interpretable neural potential.

## References

•  Artrith, N., Urban, A.: An implementation of artificial neural-network potentials for atomistic materials simulations: Performance for TiO2. Computational Materials Science 114, 135–150 (Mar 2016). https://doi.org/10.1016/j.commatsci.2015.11.047
•  Behler, J.: Perspective: Machine learning potentials for atomistic simulations. The Journal of Chemical Physics 145(17), 170901 (2016). https://doi.org/10.1063/1.4966192
•  Behler, J., Parrinello, M.: Generalized neural-network representation of high-dimensional potential-energy surfaces. Phys. Rev. Lett. 98, 146401 (Apr 2007). https://doi.org/10.1103/PhysRevLett.98.146401
• 

Bhat, H.S., Madushani, R.W.M.A.: Nonparametric Adjoint-Based Inference for Stochastic Differential Equations. In: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA). pp. 798–807 (2016).

https://doi.org/10.1109/DSAA.2016.69
•  Brunton, S.L., Proctor, J.L., Kutz, J.N.: Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the National Academy of Sciences 113(15), 3932–3937 (2016)
•  Dale, R., Bhat, H.S.: Equations of mind: Data science for inferring nonlinear dynamics of socio-cognitive systems. Cognitive Systems Research 52, 275–290 (2018)
•  Dale, R., Bhat, H.S.: sindyr: Sparse Identification of Nonlinear Dynamics (2018), https://CRAN.R-project.org/package=sindyr, r package version 0.2.1
•  Duncker, L., Bohner, G., Boussard, J., Sahani, M.: Learning interpretable continuous-time models of latent stochastic dynamical systems. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 1726–1734. PMLR, Long Beach, California, USA (09–15 Jun 2019), http://proceedings.mlr.press/v97/duncker19a.html
•  Hansen, K., Biegler, F., Ramakrishnan, R., Pronobis, W., von Lilienfeld, O.A., Muller, K.R., Tkatchenko, A.: Machine learning predictions of molecular properties: Accurate many-body potentials and nonlocality in chemical space. The Journal of Physical Chemistry Letters 6(12), 2326–2331 (2015). https://doi.org/10.1021/acs.jpclett.5b00831
•  Ramakrishnan, R., Hartmann, M., Tapavicza, E., von Lilienfeld, O.A.: Electronic spectra from TDDFT and machine learning in chemical space. The Journal of Chemical Physics 143(8), 084111 (2015). https://doi.org/10.1063/1.4928757
•  Raziperchikolaei, R., Bhat, H.S.: A block coordinate descent proximal method for simultaneous filtering and parameter estimation. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 5380–5388. PMLR, Long Beach, California, USA (09–15 Jun 2019), http://proceedings.mlr.press/v97/raziperchikolaei19a.html
•  Sahoo, S., Lampert, C., Martius, G.: Learning equations for extrapolation and control. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 80, pp. 4442–4450. PMLR, Stockholmsmässan, Stockholm Sweden (10–15 Jul 2018), http://proceedings.mlr.press/v80/sahoo18a.html
•  Zhang, L., Schaeffer, H.: On the Convergence of the SINDy Algorithm. arXiv e-prints arXiv:1805.06445 (May 2018)