^{†}

^{†}footnotetext: Pritzker School of Molecular Engineering, University of Chicago, Chicago, USA. E-mail: andrewferguson@uchicago.edu

^{†}

^{†}footnotetext: Department of Physics, University of Illinois at Urbana-Champaign, Urbana, USA.

## 1 Introduction

Molecular dynamics (MD) simulates the dynamical evolution of molecular systems by numerically integrating the classical equations of motion ^{frenkel2002understanding}. Modern computer hardware ^{stone2010gpu, Shaw2014} and efficient and scalable simulation algorithms ^{phillips2005scalable, chow2008desmond, glaser2015strong, plimpton1993fast} have enabled the simulation of billion ^{abraham2002work, abraham2002brittle} and trillion-atom systems ^{tchipev2019twetris}. Advancing the barrier in time scale has proven far more challenging. Stability of the numerical integration requires time steps on the order of femtoseconds commensurate with the fastest atomic motions ^{elber2016perspective}, which limits simulations to microseconds on commodity processors ^{elber2016perspective} and milliseconds on special purpose hardware ^{Shaw2014}. Enhanced sampling techniques apply accelerating biases and analytical corrections to recover thermodynamic averages ^{torrie1977nonphysical, mcdonald1967machine, abrams2014enhanced, miao2016unconstrained, sidky2020machine} but – except in special cases and the limit of small bias ^{chodera2011dynamical, donati2018girsanov} – no analogous approaches exist to recover unbiased dynamical trajectories from biased simulations.

The MD algorithm propagates a molecular configuration at time to via transition densities ^{noe2018machine, Fernandez2020}

. Assuming ergodicity, the probability density over microstates converges to the stationary distribution as

. Breaking the time scale barrier requires a surrogate model for that can be more efficiently evaluated and with larger time steps than MD. Accurately approximating this propagator in the high-dimensional -atom configurational space is intractable. In general, for sufficiently large there is an emergent low-dimensional simplicity that admits accurate modeling of the dynamics by a low-dimensional propagator within a latent space . The relation between MD and latent space dynamics can be represented as^{noe2018machine, Fernandez2020},

(1) | ||||||

This scheme defines three learning problems ^{noe2018machine}: (i) encoding of molecular configurations to the latent space , (ii) propagation of the latent space dynamics according to transition densities , and (iii) decoding (or generating) of molecular configurations from the latent space ^{noe2018machine}.

Markov state models (MSM) ^{husic2018markov, pande2010everything, Prinz2011, bowman2013introduction, Sidky2019, Wehmeyer2019, Mardt2018, wu2017variational} and the equation-free approach of of Kevrekidis and co-workers ^{kevrekidis2003equation, kevrekidis2004equation, kevrekidis2009equation, Mori1965, Zwanzig1973, zwanzig2001nonequilibrium, risken2012fokker} respectively employ configurational and dynamical coarse graining to parameterize low-dimensional propagators, but both methods lack molecular decoders.
Recently, numerous deep learning approaches have been proposed to learn , , and

from MD trajectories, including time-lagged autoencoders

^{Wehmeyer2018}, time-lagged variational autoencoders

^{Hernandez2017}, and time-lagged autoencoders with propagators

^{Lusch2018}. Training these networks requires a time-lagged reconstruction term within the loss, which can cause the network to fail to approximate the true slow modes

^{Chen2019}. Further, time-lagged autoencoders and time-lagged variational autoencoders do not learn valid propagators

^{noe2018machine}, and the inherent stochasticity of MD appears to frustrate learning of the propagator and decoder in time-lagged autoencoders with propagators

^{noe2018machine}. Deep generative MSMs (DeepGenMSM) simultaneously learn a fuzzy encoding to metastable states and “landing probabilities” to decode molecular configurations

^{Wu2018}. The method computes a proper propagator and generatively decodes novel molecular structures, but – as with all MSM-based approaches – it configurationally discretizes the latent space and relies on the definition of long-lived metastable states.

In this work, we propose molecular latent space simulators (LSS) as a means to train kinetic models over limited MD simulation data that are capable of producing novel all-atom molecular trajectories at orders of magnitude lower cost. The LSS can be conceived as means to augment conventional MD by distilling a kinetic model from training data, efficiently generating continuous all-atom trajectories, and computing high-accuracy estimates of any all-atom structural, thermodynamic, or kinetic observable.

The LSS is based on three deep learning networks independently trained to (i) learn an encoding into a latent space of slow variables using state-free reversible VAMPnets (SRV) ^{Chen2019a}, (ii) learn a propagator to evolve the system dynamics within this latent space using mixture density networks (MDN) ^{Bishop1994, bishop2006pattern}, and (iii) learn a decoding from the latent space to molecular configurations using a conditional Wasserstein generative adversarial network (cWGAN) ^{Gulrajani2017}. Separation of the learning problems in this manner makes training and deployment of the LSS modular and simple. The stochastic nature of the MDN propagator means that the trained kinetic model generates novel trajectories and does not simply recapitulate copies of the training data. The approach is distinguished from MSM-based approaches in that it requires no discretization into metastable states ^{Fernandez2020, Wu2018}. The continuous formulation of the propagator in the slow latent space shares commonalities with the equation-free approach ^{kevrekidis2003equation, kevrekidis2009equation}, but we eschew parameterizing a stochastic differential equation in favor of a simple and efficient deep learning approach, and also equip our simulator with a generative molecular decoder.

## 2 Methods

A schematic diagram of the LSS and the three deep networks of which it is comprised is presented in Fig. 1. We describe each of the three component networks in turn and then describe LSS training and deployment.

### 2.1 Encoder: State-free Reversible VAMPnets

The transfer operator at a lag time

is the propagator of probability distributions over microstates with respect to the equilibrium density

under transition densities^{koltai2018optimal, klus2018data}. For sufficiently large the dynamics may be approximated as Markovian so is time homogenous,

(2) |

In equilibrium systems obeying detailed balance , is identical to the Koopman operator, self-adjoint with respect to , and possesses a complete orthonormal set of eigenfunctions

with real eigenvalues

^{Noe2013, Nuske2014, klus2018data, Chen2019a, wu2020variational},

(3) |

The pair (=,=) corresponds to the equilibrium distribution at and the remainder to a hierarchy of increasingly quicker relaxing processes with implied time scales ^{Chen2019a}. The evolution of after applications of is expressed in this basis as,

(4) |

The variational approach to conformational dynamics (VAC) defines a variational principle to approximate these eigenfunctions as within a basis by solving for optimal expansion coefficients ^{Noe2013, Nuske2014, Chen2019a}. SRVs ^{Chen2019a} – themselves based on VAMPnets, a deep learning-based method for MSM construction ^{Mardt2018}, and closely related to extended dynamic mode decomposition with dictionary learning ^{li2017extended} – employ deep canonical correlation analysis (DCCA) ^{Andrew2013} to learn both the optimal expansion coefficients and

optimal basis functions as nonlinear transformations of the (featurized) molecular coordinates. This is achieved by training twin-lobed deep neural networks to minimize a VAMP-r loss function

^{Mardt2018}. SRVs trained over MD trajectories furnish an encoding (Eqn. 1) into a -dimensional latent space spanned by , where is determined by a gap in the eigenvalue spectrum. This spectral encoding into the leading modes of neglects fast processes with implied timescales (Eqn. 4) and is an optimal parameterization of the system for a low-dimensional long-time propagator

^{noe2018machine}.

### 2.2 Propagator: Mixture Density Networks

At sufficiently large the latent space = supports an autonomous dynamical system in the leading modes of . We train MDNs to learn transition densities from MD trajectories projected in the latent space. MDNs combine deep neural networks with mixture density models to overcome poor performance of standard networks in learning multimodal distributions ^{Bishop1994, bishop2006pattern}. Transition densities are approximated as a linear combination of kernels,

(5) |

where we choose to be -dimensional Gaussians. The -dependent Gaussian means

, and linear mixing coefficients are learned by a deep feedforward neural network that minimizes the loss function , where indexes pairs of time-lagged training data observations. The normalization =1 is enforced by softmax activations and the bounded using sigmoid activations.The trained MDN defines the latent space propagator (Eqn. 1) and we sample transition densities to advance the system in time (Fig. 1). Propagation is conducted entirely within the latent space and does not require recurrent decoding and encoding to the molecular representations that can lead to accumulation of errors and numerical instability ^{noe2018machine, Pathak2018}. The transition densities are learned from the statistics transitions in the training data and new trajectories are generated by sampling from these transition densities. These new trajectories therefore represent novel dynamical pathways over the latent space and are not simply recapitulations or approximate copies of those in the training data. Successful MDN training is contingent on the low-dimensional and Markovian nature of the latent space dynamics at large discovered by the SRVs.

### 2.3 Decoder: Conditional Wasserstein GAN

Generative adversarial networks are a leading neural network architecture for generative modeling ^{Goodfellow2014}. We employ a cWGAN ^{Arjovsky2017, Gulrajani2017} to decode from the latent space to molecular configurations by performing adversarial training between a generator that outputs molecular configurations from inputs and a critic that evaluates the quality of a molecular configuration . The networks are jointly trained to minimize a loss function based on the Wasserstein (Earth Mover’s) distance,

(6) |

where is the distribution over molecular configurations observed in the MD training trajectory and is a family of -Lipschitz functions enforced through a gradient penalty ^{Arjovsky2017, Gulrajani2017}. To generate molecular configurations consistent with particular states in the latent space we pass as a conditioning variable to and ^{Mirza2014} and drive the generator with -dimensional Gaussian noise . The noise enables to generate multiple molecular configurations consistent with each latent space location. We train the cWGAN over pairs by encoding each frame of the MD training trajectory into the latent space using the SRV. The trained cWGAN decoder (Eqn. 1) generates molecular configurations from the latent space trajectory produced by the propagator (Fig. 1).

## 3 Results and Discussion

### 3.1 4-well potential

We validate the LSS in an application to a 1D four-well potential ^{Prinz2011} for which analytical solutions are available. In this simple 1D system we construct the propagator directly in , so encoding and decoding are unnecessary and this test validates that the MDN can learn transition densities to accurately reproduce the system thermodynamics and kinetics. We generate a time step Brownian dynamics trajectory in a dimensionless gauge with diffusivity ==1000 and a time step =0.001 ^{beauchamp2011msmbuilder2}. A MDN was trained using Adam ^{kingma2014adam} with early stopping over the scaled trajectory at a lag time of =100, with

=8 Gaussian kernels, and two hidden layers of 100 neurons with ReLU activations

^{goodfellow2016deep}. The trained MDN was used to generate a step trajectory of the same length as the training data. Analytical transition densities were computed by partitioning the domain into 100 equal bins and defining the probability of moving from bin to bin as for , where is the potential at the center of bin and normalizes the total transition probability of bin

^{Chen2019a}.

The Brownian dynamics and synthetic LSS stationary distributions are in quantitative agreement with the analytical solution for the stationary density (Fig. 2a) and show very similar kinetic behaviors in their transitions between the four metastable wells (Fig. 2b). This agreement is due to the excellent correspondence between the analytical and learned transition densities (Fig. 2c,d).

### 3.2 Trp-cage miniprotein

We train our LSS over the 208 s all-atom simulation of the 20-residue TC10b K8A mutant of the Trp-cage mini-protein performed by D.E. Shaw Research ^{Lindorff-Larsen2011}. Generation of these MD trajectories would require 2.5 days (2 million CPU-h) on the special purpose Anton-2 supercomputer or 6 months on a commodity GPU card ^{Shaw2014}.

The SRV encoder was trained over a featurization the trajectory employing backbone and sidechain torsions and C pairwise distances as informative and roto-translationally invariant descriptors ^{Sidky2019}. We trained a SRV with two hidden layers with 100 neurons,

activations, and batch normalization using Adam

^{kingma2014adam}with a batch size of 200,000, learning rate of 0.01, and early stopping based on the validation VAMP-2 score

^{Chen2019a, Sidky2019}. A lag-time of =20 ns was chosen based on convergence of the transfer operator eigenvalues, and a =3-dimensional latent space encoding based on a gap in the eigenvalue spectrum. The MDN propagator was trained over the latent space projection of the MD trajectory at a lag time of =20 ns using Adam

^{kingma2014adam}with early stopping, =24 Gaussian kernels, and two hidden layers of 100 neurons with ReLU activations. The cWGAN decoder comprised a generator and critic with three hidden layers of 200 neurons with Swish

^{ramachandran2017swish}activations and a

=50-dimensional noise vector. The training loss stabilized after 52 epochs. The cWGAN is trained to generate the Trp-cage C

backbone by roto-translationally aligning MD training configurations to a reference structure. Training of the full LSS pipeline required 1 GPU-h on a NVIDIA GeForce GTX 1080 Ti GPU core.The trained LSS was used to produce 100208 s synthetic trajectories each requiring 5 s on a single NVIDIA GeForce GTX 1080 Ti GPU core. The LSS trajectories comprise the same total number of frames as the 208 s all-atom trajectory but contain 1070 folding/unfolding transitions compared to just 12 in the training data and were generated at six orders of magnitude lower cost. This observation illuminates the crux of the value of the approach: the LSS learns a kinetic model over limited MD training data and is then used to generate vastly longer novel all-atom trajectories that enable the observation of states and events that are only sparsely sampled in the training data. We now validate the thermodynamic, structural, and kinetic predictions of the LSS.

Thermodynamics. The free energy profiles projected into the slowest latent space coordinate = show excellent correspondence between the MD and LSS (Fig. 3). The free energy of the folded (0) and unfolded (0.9) basins and transition barrier are in quantitative agreement with a root mean squared error between the aligned profiles of 0.91 . The LSS profiles contain 10-fold lower statistical uncertainties than the MD over the same number of frames due to the 100-fold longer LSS data set enabled by their exceedingly low computational cost.

Structures. The MD and LSS molecular structures within the folded basin () and metastable transition state () possess a relative -RMSD of 0.29 nm and 0.37 nm, respectively. Relative to the Trp-cage native state (PDB ID: 2JOF), the MD and LSS folded configurations possess a

-RMSD of 0.20 nm and 0.28 nm, respectively. The mean and standard deviation of the time-averaged radii of gyration (

) for the MD (0.870.16) nm and LSS (0.870.13) nm trajectories are indistinguishable with standard errors computed by five-fold block averaging. These results demonstrate that the LSS molecular structures are in excellent accord with MD.Kinetics. We compare the MD and LSS kinetics through the autocorrelation times corresponding to the relaxation time scales associated with the =3 leading kinetic processes. All three time scales are in excellent agreement and again the LSS uncertainties are approximately 10-fold lower than the MD (Table 1).

Timescale | MD (s) | LSS (s) |
---|---|---|

3.00 0.61 | 2.89 0.12 | |

0.54 0.37 | 0.43 0.04 | |

0.45 0.12 | 0.42 0.01 |

We then employ time-lagged independent component analysis (TICA)

^{perez2013identification, noe2013variational, nuske2014variational, noe2015kinetic, noe2016commute, perez2016hierarchical, schwantes2013improvements, klus2018data}to determine whether the LSS trajectory possesses the same slow (linear) subspace as the MD. We featurize the trajectories with pairwise C distances and perform TICA at a lag time of =20 ns. Projection of the free energy surfaces into the leading three MD TICA coordinates show that the leading kinetic variance in the MD data is quite accurately reproduced by the LSS (Fig. 4). The only substantive disagreement is absence in the LSS projection of a small high-free energy metastable state at (TIC1 , TIC3 ) corresponding to configurations with Pro18 dihedral angles . These configurations are only transiently occupied due to rare Pro18 dihedral flips that occur only twice during the 208 s MD trajectory and are not contained in the =3-dimensional latent space.

## 4 Conclusions

We have presented LSS as a method to learn efficient kinetic models by training three state-of-the-art deep learning networks over MD training data and then using the trained model to generate novel all-atom trajectories at six orders of magnitude lower cost. The spirit of the approach is similar to MSM-based and equation-free approaches that use limited MD training data to parameterize highly-efficient kinetic models that can then be used to generate dynamical trajectories over vastly longer time scales than are possible with conventional MD. In contrast to these approaches, the absence of any discretization of the configurational space and provisioning with a molecular decoder enables the LSS to produce continuous all-atom molecular trajectories. Importantly, the probabilistic and generative nature of the approach means that the generated molecular trajectories are novel and not simply a reproduction of the training data, and the statistics of these trajectories accurately reproduce the structural, thermodynamic and kinetic properties of the molecular system.

The dramatic reduction in the cost of trajectory generation opens a host of valuable possibilities: vastly improved sampling of configurational space and dynamical transitions enable estimates of thermodynamic averages and kinetic rates with exceedingly low statistical uncertainties; parameterization of kinetic models with modest training data enable the production of ultra-long trajectories on commodity computing hardware; representation of the kinetic model as the parameters of a trio of deep networks enables efficient sharing of a “simulator in a box” that can then be used for rapid on-demand trajectory generation. The properties of the trained kinetic model – the dimensionality of the slow latent space, the structural correspondence of the slow modes, and the transition probabilities of the propagator – also provide fundamental insight and understanding of the physical properties of the molecular system.

As with all data-driven approaches, the primary deficiency of the LSS approach is that the resulting kinetic models are not necessarily transferable to other conditions or systems and are subject to systematic errors due to approximations in the molecular force fields and incomplete sampling of the relevant configurational space in the training data. The latter issue means that although the generated LSS trajectories are – similar to MSM-based and equation-free approaches – largely interpolative. The stochastic nature of the MDN propagator and generative nature of the cWGAN generator means that we may anticipate local extrapolations beyond the exact training configurations

^{Wu2018}. There is no expectation, however, that the trained model will discover new metastable states or kinetic transitions, and certainly not do so with the correct thermodynamic weights or dynamical time scales.

The present work has demonstrated LSS in a data-rich training regime where the MD training data comprehensively samples configurational space. The next step is to establish an adaptive sampling paradigm – similar to that in MSM construction ^{pande2010everything} and some enhanced sampling techniques ^{chen2018collective, chiavazzo2017intrinsic, preto2014fast, zheng2013rapid} – to enable its application in a data-poor regime. The adaptive sampling approach interrogates the kinetic model to identify under-sampled states and transitions that contribute most to uncertainties in the model predictions (i.e., “known unknowns”) and initializes new MD simulations to collect additional training data in these regions. This interleaving of MD training data collection and model retraining can dramatically reduce the required quantity of training data ^{pande2010everything}. Moreover, new simulations initialized in under-sampled regions may also occasionally be expected to transition into new configurational states not present in the initial training data (i.e., “unknown unknowns”) ^{preto2014fast}. Iterating this process until convergence can expand the range of the trained kinetic model to encompass the relevant configurational space and minimize the cost of training data collection.

Finally, we also envisage applications of the LSS approach to other fields of dynamical modeling where stiff or multi-scale systems of ordinary or partial differential equations, or the presence of activated processes or rare events introduces a separation of time scales between the integration time step and events of interest. For example, there may be profitable adaptations of the approach for dynamical modeling in such fields as cosmology, ecology, immunology, epidemiology, and climatology.

## Conflicts of interest

A.L.F. is a consultant of Evozyne and a co-author of US Provisional Patents 62/853,919 and 62/900,420 and International Patent Application PCT/US2020/035206.

## Acknowledgements

This work was supported by MICCoM (Midwest Center for Computational Materials), as part of the Computational Materials Science Program funded by the U.S. Department of Energy, Office of Science, Basic Energy Sciences, Materials Sciences and Engineering Division and the National Science Foundation under Grant No. CHE-1841805. H.S. acknowledges support from the National Science Foundation Molecular Software Sciences Institute (MolSSI) Software Fellows program (Grant No. ACI-1547580) ^{krylov2018perspective, wilkins2018nsf}. We are grateful to D.E. Shaw Research for sharing the Trp-cage simulation trajectories.

Comments

There are no comments yet.