 # Simulating Molecular Dynamics with Large Timesteps using Recurrent Neural Networks

Molecular dynamics simulations rely on numerical integrators such as Verlet to solve the Newton's equations of motion. Using a sufficiently small timestep to avoid discretization errors, Verlet integrators generate a trajectory of particle positions as solutions to the equations of motions. We introduce an integrator based on recurrent neural networks that is trained on trajectories generated using Verlet integrator and learns to propagate the dynamics of particles with timestep up to 4000 times larger compared to the Verlet timestep. We demonstrate significant net speedup of up to 32000 for few-particle (1 - 16) 3D systems and over a variety of force fields.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Supporting Information

### i.1 Numerical Integration

In MD simulations, one needs to follow the motion of particles interacting with potential energy . Using , the forces on particles can be derived via , and the motion of particles gets determined via Newton’s equation ( is the gradient operator). There are many numerical integration methods that can integrate these equations. Among these, the velocity Verlet integrator is a very popular choice and we have used this integrator to generate the ground truth for the experiments described in this paper. The velocity Verlet integrator  is given by the following equations:

 →x(t+Δ)=→x(t)+Δ→v(t)+Δ22m→f(t), (4)
 →v(t+Δ)=→v(t)+Δ2m(→f(t)+→f(t+Δ)). (5)

updates the current position at time to the future position after timestep  using the current velocity and the force at time . Here we have adopted the notation for a single particle of mass ; generalization to many particles is straightforward. Similarly, the integrator updates the current velocity at time to the future velocity after timestep  using the force at time and the force at time . Note that after the position update in Equation 4, the forces will need to be re-evaulated to update the velocities in Equation 5. With as the current position and as the current velocity, the propagation is continued further.

Equations 4 and 5 suggest that the velocity Verlet integrator requires the current position () and velocity coordinates () in order to evolve the dynamics forward by . Further, the integrator also uses the force term and mass to process the update. It is possible to infer using the information encoded in the stream of position data and velocity data such that the time evolution can be done with just a sequence of positions and velocities as input. Similar to ordinary verlet, we illustrate this with a 1-dimensional example of a particle experiencing simple harmonic motion governed by the force . It can be shown that the particle position can be accurately evolved to via the function that uses a sequence of 2 positions, 2 velocities and (the information about and is hidden in the input sequence). Similarly, It can be shown that the particle velocity can be accurately evolved to via the function that uses a sequence of 3 positions and 2 velocities. This idea can be extended to a general system where one could simulate the dynamics (time evolution) via the operator function : that takes a sequence of positions, velocities, and to propagate the particles forward in time.

### i.2 LSTM Networks

There are several architectures of LSTM units. A common architecture is composed of a cell (the memory part of the LSTM unit) and three “regulators”, usually called gates, that regulate the flow of information inside the LSTM unit. An input gate () controls how much new information is added from the present input () and past hidden state () to our present cell state (). A forget gate () decides what is removed or retained and carried forward to the current cell state () from the previous cell state (). An output gate () decides what to output as the current hidden state () from the current cell state (). The math formulation of LSTM is given as:

 ft =σg(Wfxt+Ufht−1+bf) it =σg(Wixt+Uiht−1+bi) ot =σg(Woxt+Uoht−1+bo) ~ct =σh(Wcxt+Ucht−1+bc) ct =ft∘ct−1+it∘~ct ht =ot∘σh(ct). (6)

Here, is the input vector to the LSTM unit, is the forget gate’s activation vector, is the input gate’s activation vector, is the output gate’s activation vector, is the hidden state vector also known as the output vector of the LSTM unit, is the cell state vector, and is the Hadamard product operator. , and

are the weight matrices and bias vector parameters which need to be learned during training.

and

represent sigmoid function and hyperbolic tangent functions respectively. The superscripts

and refer to the number of input features and the number of hidden units respectively.

### i.3 Data Generation, Preparation, and Preprocessing

Prior domain experience and backward elimination using the adjusted R squared is used for selecting the significant input parameters for creating the training data set. This process determines which inputs are important to change the desired output. The process begins with all possible inputs and eliminates them one by one, monitoring the adjusted R squared value to determine which are important. Dropping of an input that produces a sharp spike in the R squared value is indicative of the high importance of the input parameter. This process is used to select the inputs parameters for the datasets described under this section. Unlike traditional deep neural networks where you would map the physical inputs to outputs generally distinct from inputs, the inputs and outputs for the RNN approach (integrator ) are both trajectory data as shown in Figure 6. Traditional MD simulation ran with the velocity Verlet algorithm will have multi feature time series data (e.g., position, velocity vectors) separated by in time. To train the integrator , we prepare the inputs and target data as explained in Figure 6. Figure 6: Data processing for training and testing of integrator R

Since integrator will operate with different timestep (e.g., ), we filter the time series data with a filtering factor defined as to make the data separated with . Next, we do windowing through the time series data with the window length of (the RNN model sequence length), and frame overlap of to generate a sequence of length as input and the future configuration as the output (target). We do this repeatedly until the time series is ended and then move to the next choice for and a new associated filtering factor. We do the same process for all simulation data and append all the input sequences and target vectors as shown in Figure 6. In the end, all the input sequences are expected to form a 3D matrix (number of samples feature size () ) and the associated output vectors form a 2D matrix (number of samples ). A min-max normalization filter is applied to normalize the input data and randomly shuffle them (along the axis of the number of samples) to generate the data for training and testing . The entire data set is separated into training and testing sets using a ratio of 0.8:0.2. Figure 7: Potential energies associated with the 1D experiments. Solid, dotted, dashed, and dash-dotted lines represent SHO, LJ, double well, and rugged potential respectively.

We now describe the datasets associated with systems characterized by different interaction potential energies and physical attributes used in the experiments for training and testing . The potential energy functions are shown in Figure 7. For the 1D systems, the timestep (for velocity Verlet integrator) and the total time were chosen to be and . For the 3D many particle systems, and .

#### Simple harmonic oscillator (SHO)

For this system, the potential energy is given by

 U=12kx2, (7)

and the associated force is

 F=−kx. (8)

The dataset is generated by varying three input parameters: mass of the particle , spring constant , and initial position of the particle . The parameter sweep generated a data set of 500 simulations, each having 50,000 position and velocity values.

#### 1D Lennard-Jones (LJ) system

For this system, the potential energy is given by

 U(x)=4ϵ((σx)12−(σx)6), (9)

and the associated force is

 F=48ϵx((σx)12−0.5(σx)6). (10)

The dataset is generated by varying two input parameters: mass of the particle and the initial position of the particle . The parameter sweep generated a data set of 500 simulations, each having 50,000 position and velocity values.

#### Particle in a double well (DW)

For this system, the potential energy is given by

 U=14x4−12x2, (11)

and the associated force is

 F=x−x3. (12)

The dataset is generated by varying two input parameters: mass of the particle and the initial position of the particle . The parameter sweep generated a data set of 500 simulations, each having 50,000 position and velocity values.

#### Particle in a rugged potential

For this system, the potential energy is given by

 U(x)=x4−x3−16x2+4x+4850+sin(30(x+5))5, (13)

and the associated force is

 F=−4x3+3x2+32x−450−6cos(30(x+5)). (14)

The dataset is generated by varying two input parameters: mass of the particle and the initial position of the particle . The parameter sweep generated a data set of 640 simulations, each having 64,000 position and velocity values.

#### Many particles interacting with LJ potential

For this 3D system, the interaction potential energy between any two particles is given by the LJ potential:

 U(r)=4ϵ((σr)12−(σr)6), (15)

and the associated force is

 →F=48ϵr2((σr)12−0.5(σr)6)→r. (16)

We prepared two different types of simulation boxes to generate the datasets: cubic box with periodic boundary conditions (PBC) and spherical box with reflective boundary. In each box, we performed simulations with and particles. Each of these simulations were created as a separated dataset that yielded 6 different datasets to train and test . The dataset is generated by varying two input parameters: mass of the particle and the initial position of the particles with each Cartesian coordinate chosen between and . Initial velocities were chosen to be zero. The parameter sweep generated a dataset of 5000 simulations for each of the aforementioned cases (or 30,000 simulations in total).

### i.4 Feature Extraction and Regression

In the interest of brevity, we only present the details of feature extraction and regression associated with the integrator implemented for the most complex case of 16 particles interacting with LJ potential in 3D with PBC. An RNN with two LSTM layers and a dense layer (Figure 1) is implemented in TensorFlow for regression of the prediction for the trajectories of 16 particles (position and velocity vectors). Outputs of the LSTM layers are wrapped with the function. No wrapping functions are used in the final dense layer of the model as integrator was trained for regression. The weights and biases of all the layers are trained with an error backpropagation algorithm, implemented via a stochastic gradient descent procedure. The size of the hidden layers was chosen depending on the model complexity and data dimensions. By performing a grid search, hyper-parameters such as the number of units per each of the two layers (, ) and final dense layer (), batch size (

), and the number of epochs are optimized to 32, 32, 96, 256, and 2500 respectively. Here we have not done a thorough hyper-parameter optimization but have identified reasonable values for key choices – the architecture and the size of the network, and the length of time series.

takes a dimensional vector as input, where and refer to batch and feature size respectively. For the 3D system considered above, . Figure 8: Dynamics of a particle in a 1D LJ potential (m=1,x0=2.5) captured by tracking the position x vs. time t (left column), phase space plot of velocity v vs. x (center column), and the deviation in the total energy (right column). Results are shown for the dynamics produced by V and R. Top row is the dynamics from t=0 to t=13, and bottom row is the dynamics from t=987 to t=1000. Black lines are the ground truth results obtained using V with Δ=0.001. Black color circles, squares and triangles represent the results obtained using V with timestep 10Δ,50Δ, and 100Δ. Blue circles, orange squares, green triangles and red pentagons represent the results obtained using R with ΔR=100Δ,400Δ,1000Δ, and 4000Δ respectively.

Adam optimizer is used to optimize the error backpropagation. The learning rate of Adam optimizer and the dropout rate in the dropout layer is set to 0.0005 and 0.15 respectively to prevent overfitting. Both learning and dropout rates were selected using a trial-and-error process. The weights in the hidden layers and in the output layer are initialized for better convergence using a Xavier normal distribution at the beginning. Xavier normal distribution is characterized with

mean and variance, where and are input and output sizes of the hidden layers, respectively Glorot and Bengio (2010)

. The L2 error (mean square loss) between target and predicted trajectories is used for error calculation. LSTM implementation, training, and testing are programmed using scikit-learn, Keras, and TensorFlow ML libraries

Chollet et al. (2015); Buitinck et al. (2013); Abadi et al. (2016). Scikit-learn is used for grid search and scaling, Keras is used to save and load models, and TensorFlow is used to create and train the integrator .

### i.5 Additional Experiments and Empirical Results

#### Phase diagram for LJ

This experiment focuses on testing  to predict the time evolution of the 1D LJ system (with particle of mass and initial position ). Training dataset comprising of trajectories (positions and velocities) up to simulated using  with are used to train . Figure 8 shows the period (top row) from to and the period (bottom row) from to of the simulation. The period is towards the end of the simulation. For either time periods characterizing the oscillations, the positions predicted by  remain close to the numerically exact result with . On the other hand, the trajectory simulated using  shows errors rising with for timesteps 10, 50, 100 (all ). Similarly, velocities produced by  exhibit very small deviations from the ground truth result as evidenced by the phase space plots ( vs. ) in Figure 8. On the other hand, the phase space plot generated using trajectories evolved with  deviated from the ground truth. Finally, the total energy produced by  tracks the ground truth result for up to 1000 and , in stark contrast with results using  as the integrator. Figure 9: Average error δr (log scale) in position updates (trajectory) vs. time t incurred in a simulation of a N=16 particle system with LJ interaction potentials (ϵ=2 and m=1) in 3D. Initial positions are randomly selected and velocities are initialized to zero. All results are compared with the ground truth result obtained using V with Δ=0.001. Left plot shows δr for the time evolution up to t=1 million (log scale) for V using different dt=10Δ,40Δ,100Δ values. Right plot shows δr for the same time evolution using R with different ΔR=100Δ, 400Δ, 1000Δ, 4000Δ values.

#### Many particle experiments

Figure 9 shows the total error associated with the positions of many particles as a function of time for evolution driven by  and . The errors are computed using:

 δr=1NN∑i=1∣∣→ri,GT−→ri,R∣∣ (17)

where the formula is shown for . The simulated system is particles interacting in 3D with LJ potential under PBC. The ground truth (GT) results for the trajectories are generated using  with . Note that the training set consisted of dynamics up to and we intentionally kept the random initial configuration tested here outside the training and testing datasets. We find that the trajectory error associated with the positions predicted by  for  to  is up to . On the other hand, for the system evolved using  with timestep increases to at around , and for rises to very large values , rendering the time evolution of the system meaningless.