Machine learning for molecular simulation

by   Frank Noé, et al.

Machine learning (ML) is transforming all areas of science. The complex and time-consuming calculations in molecular simulations are particularly suitable for a machine learning revolution and have already been profoundly impacted by the application of existing ML methods. Here we review recent ML methods for molecular simulation, with particular focus on (deep) neural networks for the prediction of quantum-mechanical energies and forces, coarse-grained molecular dynamics, the extraction of free energy surfaces and kinetics and generative network approaches to sample molecular equilibrium structures and compute thermodynamics. To explain these methods and illustrate open methodological problems, we review some important principles of molecular physics and describe how they can be incorporated into machine learning structures. Finally, we identify and describe a list of open challenges for the interface between ML and molecular simulation.



There are no comments yet.


page 4

page 10

page 13

page 14

page 15

page 16

page 18


Machine learning enables long time scale molecular photodynamics simulations

Photo-induced processes are fundamental in nature, but accurate simulati...

Machine Learning for Molecular Dynamics on Long Timescales

Molecular Dynamics (MD) simulation is widely used to analyze the propert...

MoleculeNet: A Benchmark for Molecular Machine Learning

Molecular machine learning has been maturing rapidly over the last few y...

Deep Spatial Learning with Molecular Vibration

Machine learning over-fitting caused by data scarcity greatly limits the...

Super-resolution in Molecular Dynamics Trajectory Reconstruction with Bi-Directional Neural Networks

Molecular dynamics simulations are a cornerstone in science, allowing to...

Adversarial-residual-coarse-graining: Applying machine learning theory to systematic molecular coarse-graining

We utilize connections between molecular coarse-graining (CG) approaches...

A Statistical Perspective on the Challenges in Molecular Microbial Biology

High throughput sequencing (HTS)-based technology enables identifying an...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In 1929 Paul Dirac stated that:

“The underlying physical laws necessary for the mathematical theory of a large part of physics and the whole of chemistry are thus completely known, and the difficulty is only that the exact application of these laws leads to equations much too complicated to be soluble. It therefore becomes desirable that approximate practical methods of applying quantum mechanics should be developed, which can lead to an explanation of the main features of complex atomic systems without too much computation.” [1]

Ninety years later, this quote is still state of the art. However, in the last decade, new tools from the rapidly developing field of machine learning (ML) have started to make significant impact on the development of approximate methods for complex atomic systems, bypassing the direct solution of “equations much too complicated to be soluble”.

ML aims at extracting complex patterns and relationships from large data sets, to predict specific properties of the data. A classical application of machine learning is to the problem of image classification where descriptive labels need to be associated to images that are presented in terms of sets of pixels. The “machine” is trained on a large number of examples and “learns” how to classify new images. The underlying idea is that there exists a complex relationship between the input (the pixels) and the output (the labels) that is unknown in its explicit form but can be inferred by a suitable algorithm. Clearly, such an operating principle can be very useful in the description of atomic and molecular systems as well. We know that atomistic configurations dictate the chemical properties, and the machine can learn to associate the latter to the former without solving first principle equations, if presented with enough examples.

Although different machine learning tools are available and have been applied to molecular simulation (e.g., kernel methods [2]

), here we mostly focus on the use of neural networks, now often synonymously used with the term “deep learning”. We assume the reader has basic knowledge of machine learning and we refer to the literature for an introduction to statistical learning theory

[3, 4] and deep learning [5, 6].

One of the first applications of machine learning in Chemistry has been to extract classical potential energy surfaces from quantum mechanical (QM) calculations, in order to efficiently perform molecular dynamics (MD) simulations that can incorporate quantum effects. The seminal work of Behler and Parrinello in this direction [7] has opened the way to a now rapidly advancing area of research [8, 9, 10, 11, 12, 13, 14, 15]. In addition to atomistic force fields, it has been recently shown that, in the same spirit, effective molecular models at resolution coarser than atomistic can be designed by ML [16, 17, 18]. Analysis and simulation of MD trajectories has also been affected by ML, for instance for the definition of optimal reaction coordinates [19, 20, 21, 22, 23, 24]

, the estimate of free energy surfaces

[25, 26, 27, 22], the construction of Markov State Models [21, 23, 28], and for enhancing MD sampling by learning bias potentials [29, 30, 31, 32, 33]

or selecting starting configurations by active learning

[34, 35, 36]. Finally, ML can be used to generate samples from the equilibrium distribution of a molecular system without performing MD altogether, as proposed in the recently introduced Boltzmann Generators [37]. A selection of these topics will be reviewed and discussed in the following.

All these different aspects of molecular simulation have evolved independently so far. For instance, ML-generated force-fields have mostly been developed and applied on small molecules and ordered solids, while the analysis of MD trajectories is mostly relevant for the simulation of large flexible molecules, like proteins. In order to really revolutionize the field, these tools and methods need to evolve, become more scalable and transferable, and converge into a complete pipeline for the simulation and analysis of molecular systems. There are still some significant challenges towards this goal, as we will discuss in the following, but considering the rapid progress of the last few years, we envision that in the near future machine learning will have transformed the way molecular systems are studied in silico.

This review focuses on physics-based ML approaches for molecular simulation. ML is also having a big impact in other areas of Chemistry without involving a physics-based model, for example by directly attempting to make predictions of physicochemical or pharmaceutical properties [38, 39, 40, 41], or to designing materials and molecules with certain desirable properties using generative learning [42, 43, 44, 45]. We refer the interested reader to other recent reviews on the subject [46, 47].

The review is organized as follows. We start by describing the most important machine learning problems and principles for molecular simulation (Sec. 2). A crucial aspect of the application of ML in molecular simulations is to incorporate physical constraints and we will discuss this for the most commonly used physical symmetries and invariances for molecular systems (Sec. 3). We will then provide examples of specific ML methods and applications for molecular simulation tasks, focusing on deep learning and the neural network architectures involved (Sec. 4). We conclude by outlining open challenges and pointing out possible approaches to their solution (Sec. 5).

2 Machine Learning Problems for Molecular Simulation

In this section we discuss how several of the open challenges in the simulation of molecular systems can be formulated as machine learning problems, and the recent efforts to address them.

2.1 Potential energy surfaces

Molecular dynamics (MD) and Markov Chain Monte Carlo (MCMC) simulations employing classical force fields within the Born-Oppenheimer approximation constitute the cornerstone of contemporary atomistic modeling in chemistry, biology, and materials science. These methods perform importance sampling, i.e. in the long run, they sample states

from the molecular system’s equilibrium distribution which has the general form:


The reduced potential contains terms depending on the thermodynamic constraints (e.g. fixed temperature, pressure, etc.). In the canonical ensemble (fixed number of particles, volume and temperature), where is the thermal energy at temperature .

However, the predictive power of these simulations is only as good as the underlying potential energy surface (PES). Hence, predictive simulations of properties and functions of molecular systems require an accurate description of the global PES, , where indicates the nuclear Cartesian coordinates. All many-body interactions between electrons are encoded in the function. Although could be obtained on the fly using explicit ab initio calculations, more efficient approaches that can access long time scales are required to understand relevant phenomena in large molecular systems. A plethora of classical mechanistic approximations to exist, in which the parameters are typically fitted to a small set of ab initio calculations or experimental data [48, 49, 50, 51]. Unfortunately, these classical approximations often suffer from the lack of transferability and can yield accurate results only close to the conditions (geometries) they have been fitted to.

Alternatively, sophisticated ML approaches that can accurately reproduce the global potential energy surface (PES) for elemental materials [7, 52, 9, 53, 14, 54] and small molecules [55, 56, 11, 10, 57, 54, 58] have been recently developed (see Fig. 1). Such methods learn a model of the PES, , where the parameters

are optimized either by energy matching and/or force matching. In energy matching, an ML model is trained to minimize the loss function:


where are energy values obtained by QM calculations at specific configurations (see Fig. 1). For force matching, we compute the QM forces at specified configurations


and minimize the loss function:


The existing ML PES models are based either on non-linear kernel learning [9, 53, 57, 10] or (deep) neural networks [52, 14, 58]. Specific neural network architectures will be discussed in Sec. 4.1 and 4.2. Both approaches have advantages and limitations. The advantage of the kernel methods is their convex nature yielding a unique solution, whose behavior can be controlled outside of the training data. Neural networks are non-convex ML models and often harder to interpret and generalize outside of the training manifold.

Figure 1: Constructing force field models with ML: a) Reference geometries are sampled from a MD trajectory that is sufficiently long to ensure an optimal coverage of the configuration space. b) For a small subset of geometries, energy and force labels are then computed at a high level of theory to form the training, validation and test datasets. A globally consistent atom-atom assignment across the whole training set enables the identification and reconstructive exploitation of relevant spatial and temporal physical symmetries of the MD. c) The resulting ML model of the PES is finally used to speed up sampling intensive path-integral MD simulations at the accuracy of the reference electronic structure method [10].

2.2 Free energy surfaces

Given the Cartesian coordinates of a molecule with atoms, , we define the collective coordinates by the encoding:


where and is a small number. In machine learning terms, the coordinates define a latent space. In molecular simulations, the collective coordinates are often defined to describe the slowest processes of the system [59]. In general, the mapping (or encoding) can be highly non-linear. If the energy function associated with the atomistic representation is known, an important quantity to compute, for instance to connect with experimental measurements, is the free energy of the system as a function of the collective variables. The free energy is defined as:


where is the marginal distribution on of the equilibrium distribution given by Eq. (1):


The integral in Eq. (7) is in practice impossible to compute analytically for high dimensional systems, and several methods have been developed for its numerical estimation by enhancing the sampling of the equilibrium distribution in molecular dynamics simulations of the system.

The definition of the free energy can also be formulated as a learning problem: the aim is to optimize the parameters of a free energy function model such that Eqs. (6-7) are satisfied to a good approximation. Fulfilling these equations is usually referred to as enforcing thermodynamic consistency.

Using methods that estimate the free energy (or its gradient ) at a given set of points in the collective variables space (, one can use the free energy loss:

or the free energy gradient loss:

with these free energy estimates in order to reconstruct the entire surface . Both kernel regression [25] and neural networks [27] have been used to this purpose. Machine learning has also been used in combination with enhanced sampling methods to learn the free energy surface on-the-fly [26, 60, 61, 22, 62, 63, 64, 65].

An alternative way to learn (6) is by using force matching [66, 67]. It has been demonstrated [68] that, given the forces of the atomistic system collected along a molecular dynamic trajectory, , , the gradient of a free energy model that best satisfies Eq. (6) also minimizes the force matching loss:


where the term is the local mean force:

that is, the projection of the atomistic force on the collective variable space through the mapping .

In practice, the estimator (8) is very noisy: because of the dimensionality reduction from to , multiple realizations of the projected force can be associated to the same value of the collective coordinates and the minimum of the loss function (8

) can not go to zero. By invoking statistical estimator theory it can be shown that this loss can be broken down into a bias, variance and noise terms


2.3 Coarse-graining

The use of coarse-grained models of complex molecular systems, such as proteins, presents an attractive alternative to atomistic models that are very expensive to simulate [69, 67].

The design of a coarse-grained model for a system with atoms into a reduced representation with effective “beads” starts by the definition of a mapping similar to Eq. (5) where now are the coordinates of the coarse-grained beads. In this case, the mapping is usually a linear function, as in general the beads are defined as a subset or a linear combination of sets of atoms in the original system, in a way that allows to keep some information about the geometry of the molecule. For instance, in protein systems, a coarse-grained mapping can be defined by replacing all the atoms in a residue by a bead centered on the corresponding atom. There is at present no general theory to define the optimal coarse-graining mapping for a specific system. A few methods have been proposed in this direction, by optimizing the mapping to preserve certain properties of the original system. Examples include the definition of a system’s dynamical assembly units [70]

or the use of an autoencoder to minimize the reconstruction error


Once the coarse-graining mapping is given, several strategies exist to define the model energy function, either to match experimental observables (top-down), or to reproduce specific properties of the atomistic system (bottom-up) [67]. If one wants to define a coarse-grained model that is thermodynamically consistent with the original model (Eqs. 6-7), then, the definition of the energy function associated with a coarse-grained molecular model can be seen as a special case of free energy learning discussed in the previous section [66]. The effective energy of the coarse-grained model is the free energy defined by Eq. (6) where are now the coarse variables. Therefore, once the coarse-graining map is defined, the definition of the effective potential can also be seen as a learning problem. In particular, the force matching loss function given by Eq. (8) can be used to train the effective energy of the model from the atomistic forces. In the case of a linear mapping with a crisp assignment of atoms into beads, a matrix exists such that and the expression for the loss, Eq. (8), reduces to:


where for each bead , is the sum of the atomistic forces of all the atoms mapping to that bead. This loss function has been used to design coarse grained force fields for different systems both with kernel methods [16] and deep neural networks [17, 18].

2.4 Kinetics

Kinetics are the slow part of the dynamics. Due to the stochastic components in the MD integrator, for any trajectory emerging from a configuration at time

, there is a probability distribution of finding the molecule in configuration

at a later time :


We can express the transition density (10) as the action of the Markov propagator in continuous-space, and by its spectral decomposition [72, 73]:


The spectral decomposition can be read as follows: The evolution of the probability density can be approximated as the superposition of functions . A second set of functions, , is required in order to compute the amplitudes of these functions.

In general, Eq. (11

) is a singular value decomposition with left and right singular functions

and true singular values [73]. The approximation then is a low-rank decomposition in which the small singular values are discarded. For the special case that dynamics are in thermal equilibrium, Eq. (21) holds, and Eq. (11

) is an eigenvalue decomposition with the choices:


Hence Eq. (11

) simplifies: we only need one set of functions, the eigenfunctions

. The true eigenvalues are real-valued and decay exponentially in time with the characteristic relaxation times that are directly linked to kinetic experimental observables [74, 75]. The approximation in Eq. (11) is due to truncating all terms with relaxation times shorter than .

A quite established approach is to learn molecular kinetics (Eq. 11) from a given trajectory dataset. In order to obtain a low-dimensional model of the molecular kinetics that is easy to interpret and analyze, this usually involves two steps: (i) finding a low-dimensional latent space representation of the collective variables, , using the encoder , and (ii) learning a dynamical propagator in that space:


A common approach in MD, but also in other fields such as dynamical systems and fluid mechanics, is to seek a pair , such that

is a small matrix that propagates state vectors in a Markovian (memoryless) fashion

[76, 77, 78, 79, 74, 80, 81, 82, 83, 84, 73]. This is motivated by the spectral decomposition of dynamics (Eq. 11): If is large enough to filter fast processes, a few functions are sufficient to describe the kinetics of the system, and if maps to the space spanned by these functions, can be a linear, Markovian model.

If no specific constraints are imposed on , the minimum regression error of , the variational approach of Markov processes (VAMP) [85, 73], or maximum likelihood will all lead to the same estimator [86]:


using the latent space covariance matrices


performs a one-hot-encoding that indicates which “state” the system is in, then the pair

is called Markov state model (MSM [76, 77, 78, 79, 74, 80]), and is a matrix of conditional probabilities to be in a state at time given that the system was in a state at time .

Learning the embedding is more difficult than , as optimizing by maximum likelihood or minimal regression error in latent space leads to a collapse of to trivial, uninteresting solutions [86]. This problem can be avoided by following a variational approach to approximating the leading terms of the spectral decomposition (10) [85, 73]. The Variational Approach for Conformation dynamics (VAC) [85] states that for dynamics obeying detailed balance (21), the eigenvalues of a propagator matrix via any encoding are, in the statistical limit, lower bounds of the true . The VAMP variational principle is more general, as it does not require detailed balance (21), and applies to the singular values :

As VAMP is the more general principle, we can use it to define the loss function for estimating molecular kinetics models:


If dynamics obey detailed balance, we can use and plug in a symmetric estimate for [86, 23].

2.5 Sampling and Thermodynamics

MD time steps are on the order of one femtosecond ( s), while configuration changes governing molecular function are often rare events that may take to s. Even when evaluating the potential and the forces it generates is fast, simulating single protein folding and unfolding events by direct MD simulation may take years to centuries on a supercomputer. To avoid this sampling problem, machine learning methods can be employed to learn generating equilibrium samples from more efficiently, or even by generating statistically independent samples in “one shot” (Sec. 2.5).

Learning to sample probability distributions is the realm of generative learning [87, 88, 89, 90]. In the past few years, directed generative networks, such as variational Autoencoders (VAEs) [88], generative adversarial networks (GANs) [89], and flows [91, 90], have had a particular surge of interest. Such networks are trained to transform samples from an easy-to-sample probability distribution, such as Gaussian noise, into realistic samples of the objects of interest. These methods have been used to draw photorealistic images [92, 93], generate speech or music audio [94] and generate chemical structures to design molecules or materials with certain properties [42, 43, 44, 45].

If the aim is to learn a probability distribution from sampling data (density estimation) or learn to efficiently sample from a given probability distribution (Boltzmann generation [37]), one typically faces the challenge of matching the probability distribution of the trained model with a reference.

Matching probability distributions can be achieved by minimizing probability distances. The most commonly used probability distance is the Kullback-Leibler (KL) divergence, also called relative entropy between two distributions and :


In Boltzmann generation [37], we aim to efficiently sample the equilibrium distribution of a many-body system defined by its energy function (Eq. 1). We can learn the parameters of a neural network that generates samples from the distribution by making its generated distribution similar to the target equilibrium distribution. Choosing and in Eq. (16), we can minimize the with the energy loss:


As the network samples from an energy surface (Eq. 1), this loss is performing energy matching, similar as in Sec. 2.1. In order to evaluate the loss (17), we need not only to be able to generate samples from the network, but also to be able to compute for every sample . See Sec. 4.5 for an example.

The reverse case is density estimation. Suppose we have simulation data , and we want to train the probability distribution to resemble the data distribution. We choose as the data distribution and in Eq. (16), and exploit that is a constant that does not depend on the parameters . Then we can minimize the loss:


This loss is the negative likelihood that the model generates the observed sample, hence minimizing it corresponds to a maximum likelihood approach. Likelihood maximization is abundantly used in machine learning, in this review we will discuss it in Sec. 4.5.

3 Incorporating Physics into Machine Learning

3.1 Why incorporate physics?

Compared to classical ML problems such as image classification, we have a decisive advantage when working with molecular problems: we know a lot of physical principles that restrict the possible predictions of our machine to the physically meaningful ones.

Let us start with a simple example to illustrate this. We are interested in predicting the potential energy and the atom-wise forces of the diatomic molecule with positions in vacuum (without external forces). Independent of the details of our learning algorithm, physics tells us that a few general rules must hold:

  1. The energy is invariant when translating or rotating the molecule. We can thus arbitrarily choose the positions , and the energy becomes a function of the interatomic distance only: .

  2. Energy conservation: the energy and the force of the molecule are related by . Now we can compute the components of the force as: .

  3. Indistinguishability of identical particles: The energy is unchanged if we exchange the labels “1” and “2” of the identical oxygen atoms.

In machine learning, there are two principal approaches when dealing with such invariances or symmetries: (i) Data augmentation, and (ii) Building the invariances into the ML model.

3.2 Data augmentation

Data augmentation means we learn invariances “by heart” by artificially generating more training data and applying the known invariances to it. For example, given a training data point for positions and energy/force labels, , we can augment this data point with translation invariance of energy and force by adding more training data with random displacements . Data augmentation makes the ML model more robust and will help to predicting the invariances approximately. It is an important ML tool, as it is easy to do, while for certain invariances it is conceptually difficult or computationally expensive to hard-wire them into the ML model.

However, data augmentation is statistically inefficient, as additional training data are needed, and inaccurate because a network that does not have translation invariance hard-wired into it will never predict that the energy is exactly constant when translating the molecule. This inaccuracy may lead to unphysical, and potentially catastrophic predictions when such energy model is inserted into an MD integrator.

3.3 Building physical constraints into the ML model

The more accurate, statistical efficient and also more elegant approach to incorporating physical constraints is to directly build them into the ML model. Doing so involves accounting for two related aspects that are both essential to make the learning problem efficient: (i) equivariances: the ML model should have the same invariances and symmetries as the modeled physics problem, (ii) parameter sharing: whenever there is an invariance or symmetry, this should be reflected by sharing model parameters in different parts of the network.

3.4 Invariance and Equivariance

If we can hard-wire the physical symmetries into the ML structure, we can reduce the dimensionality of the problem. In the example of the molecule above, the energy and force are one- and six-dimensional functions defined for each point in 6-dimensional configuration space. However, when we use the invariances described above we only have to learn a one-dimensional energy function over a one-dimensional space, and can compute the full force field from it. The learning problem has become much simpler because we now learn only in the manifold of physically meaningful solutions (Fig. 2a).

An important concept related to invariance is equivariance. A function is equivariant if it transforms the same way as its argument. For example, the force is defined by the negative gradient of the energy, . If we rotate the molecule by applying the rotation , the energy is invariant, but the force is equivariant as it rotates in the same way as (Fig. 2b):

Equivariances are closely linked to convolutions in machine learning. For example, standard convolutions are translation invariant: Each convolution kernel is a feature detector that is applied to each pixel neighborhood [95]. When that feature is present in the image, the convolved image will show a signal in the corresponding position (Fig. 2c). When translating the input, the convolved image translates in the same way (Fig. 2c).

Figure 2: Physical constraints a) Physical constraints define a manifold of physically valid solutions for a given ML model class, e.g. a given neural network architecture: only certain combinations of parameters will obey these physical constraints. Using data augmentation, the network can learn “by heart” to make almost physically valid predictions. Directly building physical constraints into the ML model is more data-efficient and accurate, as then every prediction is physically meaningful. b) Invariance and equivariance: Upon rotating the beads, the spring energy is invariant, but the forces of the beads are equivariant, i.e. they rotate in the same way. c) Translational equivariance in convolutional layers: convolving a translated image with a filter results in a translated feature map. Modified from

We will briefly review below common invariances/equivariances useful for applications to molecular simulations, and discuss strategies to incorporate them in an ML algorithm.

3.4.1 Rototranslational invariance/equivariance

Physical quantities that only depend on the interactions of the atoms within a molecule should be invariant with respect to translation and rotation . Examples include potential and free energies of a molecule without an external field:

The force is equivariant to rotation, but invariant to translation (Fig. 2b, Sec. 3.4):

Rototranslational invariance can be achieved by transforming the configuration into roto-translationally invariant features, such as intramolecular distances or angles. Equivariance of the force can then be achieved by computing the force explicitly by a network layer that computes the derivatives with respect to , as it is done, e.g., in SchNet [96] and CGnet [18] (Fig. 5a). Note that periodic systems, such as crystals and explicit-solvent boxes have translational but not rotational invariances/equivariances.

3.4.2 Permutational invariance/equivariance.

Physical quantities, such as quantum mechanical energies, are invariant if we exchange the labels of identical atoms, e.g., Carbons. As the number of possible permutations increases exponentially with the number of identical particles, trying to learn permutation invariance by data augmentation is hopeless. Permutation invariance can be built into the ML model by choosing a functional form to compute the quantity of interest that is permutation invariant – see [97] for the general conditions. Following pioneering work of [98, 99, 100, 101], a specific choice that is common for networks that compute extensive quantities such as potential energies is to model it as a sum


where is the contribution of the energy by the th atom in its chemical environment [7, 11, 58, 96]. In order to account for the multi-body character of quantum mechanics, must generally also be a multi-body function. Eq. (19) can be implemented by using separate networks for the individual contributions and adding up their results (Fig. 3). The force resulting from Eq. (19) is automatically permutation equivariant.

Classical MD force fields define bonded interactions by assigning atoms to nodes in a bond graph, and thus exchanging of individual atom pairs no longer preserves the energy. However, the energy is still invariant to the exchange of identical molecules, e.g., solvent, and is important to take that into account for coarse-graining [18] and generating samples from the equilibrium density [37].

A simple alternative to building permutation invariance into the ML function is to map all training and test data to a reference permutation. This can be done efficiently by so-called bipartite graph matching methods such as the Hungarian method [102], that are frequently used in recent learning models [103, 10, 37].

3.4.3 Energy conservation

Closed physical systems are energy-conserving, which implies that the force field is defined by the gradient of the energy (Eq. 3). When learning potential energy, it can be a great advantage to use force information, because each -atom configuration is associated with only one energy but forces, which may result in superior data efficiency when force labels are used during learning [57, 10]

. If we use supervised learning for coarse-graining with thermodynamic consistency, we can only use forces, as no labels for the free energies are available (Sec.


In these examples, we have labeled training data . Using a network that directly predicts the forces would not guarantee that Eq. (3) holds. Therefore, when Eq. (3) holds, it is common to learn an energy and then computing the force in the network using a gradient layer [96, 18] (Fig. 5a). An alternative to ensure Eq. (3) is gradient-domain machine learning [57].

3.4.4 Probability conservation and stochasticity

In statistical mechanics (both equilibrium and non-equilibrium), we are interested in the probability of events. An important principle is thus probability conservation, i.e. that the sum over all events is 1. A common approach to encode the probability of classes or events with a neural network is to use a SoftMax output layer, where the activation of each neuron can be defined as


where runs over all neurons in the layer. In this representation, can be seen as a vector of energies giving rise to the Boltzmann probabilities . are nonnegative because of the exponential functions and sum up to .

In Markov state models (MSMs) of molecular kinetics, we would like to obtain a Markov transition matrix which is stochastic, i.e. for all elements and for all . If the encoder uses a SoftMax, then the estimator (14) will result in a transition matrix that conserves probability ( for all ). SoftMax can be exploited in VAMPnets to simultaneously learn an embedding from configurations to metastable states, and a Markov transition matrix (Sec. 4.4) [21].

3.4.5 Detailed balance

Detailed balance connects thermodynamics and dynamics. In a dynamical system that evolves purely in thermal equilibrium, i.e. without applying external forces, the equilibrium distribution and the transition probability are related by the following symmetry:


Thus, the unconditional probabilities of forward- and backward-trajectories are equal. Enforcing Eq. (21) in kinetic ML models ensures the spectral decomposition (11) is real-valued, which is useful for many analyses.

In order to learn a kinetic model that obeys detailed balance, other estimators than Eq. (14) must be used, that typically enforce the unconditional transition probabilities to be symmetric. See [86] for details.

3.5 Parameter sharing and convolutions

A decisive advance in the performance of neural networks for classical computer vision problems such as text or digit recognition came with going from fully connected “dense” networks to convolutional neural networks (CNN)

[95]. Convolution layers are not only equivariant, and thus help with the detection of an object independent of its location (Fig. 2c), the real efficiency of CNNs is due to parameter sharing.

In a classical dense network, all neurons of neighboring layers are connected and have independent parameters stored in a weight matrix

. This leads to a problem with high-dimensional data. If we were to process relatively small images, say

pixels, and associate each pixel with a neuron, a single dense neural network layer will have parameters in . This would not only be demanding in terms of memory and computing time, a network with so many independent parameters would likely overfit and not be able to generalize to unknown data.

Convolutions massively reduce the number of independent parameters. A convolutional layer with a filter acting on a one-dimensional signal computes, before applying bias and nonlinearities,


In terms of an image, convolution applies the same filter to every pixel neighborhood, in other words, all pixel transformations share the same parameters. Additionally, the filters are usually much smaller than the signal (often for images), thus each convolution only has few parameters. For molecules, we extend this idea to continuous convolutions that use particle positions instead of pixels (Sec. 4.2).

Besides the sheer reduction of parameters, parameter sharing, e.g. via convolution layers, is the keystone of transferability across chemical space. In convolutions of a greyscale image, we apply the same filters to every pixel, which implies translational equivariance, but also means “the same rules apply to all pixels”. If we have multiple color channels, we have different channels in the filters as well, so same color channels behave the same. In molecules we can apply the same idea to particle species. When learning energies from QM data, for example, every chemical element should be treated the same, i.e. use the same convolution filters in order to sense its chemical environment. This treatment gives us a “building block” principle that allows us to train on one set of molecules and make predictions for new molecules.

4 Deep Learning Architectures for Molecular Simulation

In this section, we present specific methods and neural network architectures that have been proposed to tackle the machine learning problems discussed in Sec. 2.

4.1 Behler-Parrinello, ANI, Deep Potential net

Behler-Parrinello networks are one of the first applications of machine learning in the molecular sciences [7]. They aim at learning and predicting potential energy surfaces from QM data and combine all of the relevant physical symmetries and parameter sharing for this problem (Sec. 3).

First, the molecular coordinates are mapped to roto-translationally invariant features for each atom (Fig. 3a). In this step, the distances of neighboring atoms of a certain type, as well as the angles between two neighbors of certain types are mapped to a fixed set of correlation functions that describes the chemical environment of atom . These features are the input to a dense neural network which outputs one number, the energy of atom in its environment. By design of the input feature functions, this energy is roto-translationally invariant. Parameters are shared between equivalent atoms, e.g., all carbon atoms are using the same network parameters to compute their atomic energies, but since the chemical environments will differ, their energies will differ. In a second step, the atomic energies are summed over all atoms of the molecule (Fig. 3b). This second step, combined with parameter sharing, achieves permutation invariance, as explained in Sec. 3.4.2. Transferability is achieved due to parameter sharing, but also because the summation principle allows to grow or shrink the network to molecules of any size, including sizes that were never seen in the training data.

A related approach is Deep-Potential net [58]

, where each atom is treated in a local coordinate frame that has the rotation an translation degrees of freedom removed.

Behler-Parrinello networks are traditionally trained by energy matching (Sec. 2.1) [7, 52], but can be trained with force matching if a gradient layer is added to compute the conservative force (Sec. 2.1, Eq. 3). The Behler-Parrinello method has been further developed in the ANI network, e.g., by extending to more advanced functions involving two neighbors [11].While Behler-Parrinello networks have mainly been used to make predictions of the same molecular system in order to run MD simulations unaffordable by direct ab-initio QM MD [52], ANI has been trained on DFT and coupled-cluster data across a large chemical space [11, 104, 12].

Figure 3: Behler-Parrinello networks for learning quantum energies. a) Behler-Parrinello network computing the atomic energy of a single atom . The system coordinates are mapped to roto-translationally invariant features describing the chemical environment of atom . b) A molecular system, e.g., , is composed by employing a network copy for each atom. The parameters are shared between networks for same elements. The total energy is given by the sum of atomic energies, introducing permutation invariance.

4.2 Deep Tensor Neural Nets, SchNet and continuous convolutions

One of the first deep learning architectures to learn to represent molecules or materials is the family of Deep Tensor Neural Networks (DTNN)

[14], with its recent addition SchNet [54, 105]. While in kernel-based learning methods [106, 2] chemical compounds are compared in terms of pre-specified kernel functions [8, 107, 108, 109], DTNN and its extension SchNet learn a multi-scale representation of the properties of molecules or materials from large data sets.

DTNNs were inspired by the language processing approach word-to-vec [110], where the role of a word within its grammatical or semantic context are learned and encoded in a parameter vector. Likewise, DTNNs learn a representation vector for each atom within its chemical environment (Fig. 4b left). DTNN’s tensor construction algorithm then iteratively learns higher-order representations by first interacting with all pairwise neighbors, e.g. extracting information implemented in the bond structure (Fig. 4b middle). By stacking such interaction layers deep, DTNNs can represent the structure and statistics of multi-body interactions in deeper layers. As DTNNs are end-to-end trained to predict certain quantum mechanical quantities, such as potential energies, they learn the representation that is relevant for the task of predicting these quantities [111].

SchNet [54] uses a deep convolutional neural network (CNN)[112, 113]. Classically, CNNs were developed for computer vision using pixelated images, and hence use discrete convolution filters. However, the particle positions of molecules cannot be discretized on a grid as quantum mechanical properties such as the energy are highly sensitive to small position changes, such as the stretching of a covalent bond. For this reason, SchNet introduced continuous convolutions [54, 96], which are represented by filter-generating neural networks that map the roto-translationally invariant interatomic distances to the filter values used in the convolution (Fig. 4b right).

DTNN and SchNet have both reached highly competitive prediction quality both across chemical compound space and across configuration space in order to simulate molecular dynamics. In addition to their prediction quality, their scalability to large data sets [105] and their ability to extract novel chemical insights by means of their learnt representation make the DTNN family an increasingly popular research tool.

Figure 4: SchNet, a continuous convolution framework: a) As molecular structures cannot be well discretized on a grids, SchNet generalizes the ConvNets approach to continuous convolutions between particles. b) SchNet architecture: the input, consisting of atom types (chemical elements ) and positions is processed through several layers to produce atom-wise energies that are summed to a total energy (left). The most important layer is the interaction layer in which atoms interact via continuous convolution functions (middle). Continuous convolutions are generated by dense neural networks that operate on the interatomic distances, ensuring roto-translational invariance of the energy.

4.3 Coarse-graining: CGnets

Figure 5: CGnet. a) The CGnet neural network architecture can be used to design a coarse-grained model by force-matching, by using the loss function of Eq. (9). b) Application of CGnet to the coarse-graining of the small protein Chignolin, from the fully atomistic and solved model to a -only model of 10 beads. c) Results of CGnet for Chignolin. i) The free energy of the original atomistic model, as a function of the two reaction coordinates. State a is the folded state, b is the unfolded state, and c is a misfolded state. ii) The free energy resulting from a coarse-grained model where only two-body terms are included in the energy function. State a’, b’, and c’ correspond to states a, b, and c in i). iii) The free energy resulting from CGnet. iv) The same free energies as in i), ii), and iii) but as a function of only one reaction coordinate. Figures adapted from [18].

As mentioned in section 2.3, machine learning has been used to define coarse-grained models for molecular systems. Both kernel methods [16] and deep neural networks [17, 18] have been designed to minimize the force-matching loss, Eq. (9), for given coarse-graining mappings for specific systems.

In both cases, it has been shown that the incorporation of physical constraints is crucial to the success of the model. The training data are obtained by means of atomistic molecular dynamic simulations and regions of the configurational space that are physically forbidden, such as configurations with broken covalent bonds or overlapping atoms, are not sampled and not included in the training. Without additional constraints, the machine cannot make predictions far away from the training data, and will thus not reliably predict that the energy should diverge when approaching physically forbidden regions.

Excluding high-energy states such as broken bonds or colliding atoms is different from enforcing physical symmetries as described in Sec. 3.4. Rather, it is about enforcing the correct asymptotic behavior of the energy when going towards an unphysical limit. CGnets proposed to achieve this by learning the difference to a simple prior energy that was defined to have the correct asymptotic behavior [18] (Fig. 5a). The exact form of this prior energy is not essential for success, as the CGnet can correct the prior energy where training data is available. In [18], the prior energy consisted of harmonic terms for bonds and angles of coarse-grained particles whose equilibrium values and force constants were obtained with Boltzmann inversion, as well as excluded volume terms in the form of were is the inter-particle distance and are hyper-parameters.

As in Behler-Parrinello networks and SchNet, CGnet predicts a roto-translationally invariant energy as the first layer transforms the Cartesian coordinates into internal coordinates such as distances and angles (Fig. 5a). Furthermore, CGnet predicts a conservative and rotation-equivariant force field as the gradient of the total free energy with respect to input configuration is computed self-consistently by the network (see Fig. 5a). The network is trained by minimizing the force matching loss of this prediction (Eq. 8).

Fig. 5b-c shows an application of CGnets to the coarse-graining of the mini-protein Chignolin, in which all solvent molecules are coarse-grained away and the atoms of each residue are mapped to the corresponding atom (Fig. 5b). MD simulations performed with the force field predicted by the CGnet predicts a free energy surface that is quantitatively similar to the free energy surface of the all-atom simulations, and resolves the same metastable states (folded, unfolded and misfolded). In contrast, a spline-based coarse-grained model where only two-body terms are included in the energy function cannot reproduce the all-atom free energy surface, and does not even predict that folded and unfolded are separated metastable states. These results clearly illustrate the importance of multi-body interactions in the coarse-grained energy, for example surface or volume terms that can describe implicit solvation. While the spline model can be dramatically improved by adding suitable terms to list of features, this is not necessary when using a deep neural network which automatically learns the required multi-body terms. Similar conclusions can be obtained by using Gaussian Approximation Potentials as the machine learning model to capture multi-body terms in coarse-grained energy functions [16].

4.4 Kinetics: VAMPnets

VAMPnets [21] were introduced to replace the complicated and error-prone approach of constructing Markov state models by (i) searching for optimal features, (ii) combining them into a low-dimensional representation , e.g., via TICA [114], (iii) clustering , (iv) estimating the transition matrix , and (v) coarse-graining . VAMPnets uses instead a single end-to-end learning approach in which all of these steps are replaced by a deep neural network. This is possible because with the VAC and VAMP variational principles (Sec. 2.4, [85, 73]), loss functions are available that are suitable to train the embedding and the propagator simultaneously (see Sec. 2.4, Eq. 13).

VAMPnets contain two network lobes representing the embedding . These networks transform the molecular configurations found at a time delay along the simulation trajectories (Fig. 6a). VAMPnets can be trained by minimizing the VAMP loss, Eq. (15), which is meaningful for both dynamics with and without detailed balance [73]. VAMPnets may, in general, use two distinct network lobes to encode the spectral representation of the left and right singular functions (which is important for non-stationary dynamics [115, 116]). EDMD with dictionary learning [117] uses a similar architecture as VAMPnets, but is optimized by minimizing the regression error in latent space. In order to avoid collapsing to trivial embeddings such as constant functions (see Sec. 2.4) a suitable regularization must be employed [117].

While hyper-parameter selection can be performed by minimizing the variational loss (15) on a validation set [118, 119, 120, 121], it is important to test the performance of a kinetic model on timescales beyond the training timescale . We can use the Chapman-Kolmogorov equation to test how well the learnt model predicts longer times:


A common way to implement this test is to compare the leading eigenvalues of the left and right hand sides [77, 122, 80].

In [21], parameters were shared between the VAMPnets nodes, and thus a unique embedding is learned. When detailed balance is enforced while computing (Eq. 21), the loss function automatically becomes a VAC score [85]. In this case, the embedding encodes the space of the dominant Markov operator eigenfunctions [21]. This feature was extensively studied in state-free reversible VAMPnets [23].

In order to obtain a propagator that can be interpreted as a Markov state model, [21]

chose to use a SoftMax layer as an output layer, thus transforming the spectral representation to a soft indicator function similar to spectral clustering methods such as PCCA+

[123, 124]. As a result, the propagator computed by Eq. (14) conserves probability and is almost a transition matrix (Sec. 3.4.4), although it may have some negative elements with small absolute values.

The results described in [21] (see, e.g., Fig. 6) were competitive with and sometimes surpassed the state-of-the-art handcrafted MSM analysis pipeline. Given the rapid improvements of training efficiency and accuracy of deep neural networks seen in a broad range of disciplines, we expect end-to-end learning approaches such as VAMPnets to dominate the field eventually.

In [28], a deep generative Markov State Model (DeepGenMSM) was proposed that, in addition to the encoder and the propagator learns a generative part that samples the conditional distribution of configurations in the next time step. The model can be operated in a recursive fashion to generate trajectories to predict the system evolution from a defined starting state and propose new configurations. The DeepGenMSM was demonstrated to provide accurate estimates of the long-time kinetics and generate valid distributions for small molecular dynamics benchmark systems.

Figure 6: VAMPnet and application to alanine dipeptide. a) a VAMPnet [21] includes an encoder which transforms each molecular configuration to a latent space of “slow reaction coordinates” , and is trained on pairs sampled from the MD simulation using the VAMP score [73]. If the encoder performs a classification, the dynamical propagator is a Markov state model. b) Structure of alanine dipeptide. The backbone torsion angles and , describe the slow kinetics of conformation changes, but Cartesian coordinates of heavy atoms are used as VAMPnet inputs here. c) Classification of the VAMPnet encoder of MD frames to metastable states in the and space. Color corresponds to activation of the respective output neuron. d) equilibrium probabilities of Markov states and transition probabilities given by . e) Chapman–Kolmogorov test comparing long-time predictions of the VAMPnet model estimated at with estimates at longer lag times. Figure modified from [21].

4.5 Sampling/Thermodynamics: Boltzmann Generators

Boltzmann Generators were introduced in [37] to learn to sample equilibrium distributions, Eq. (1). In contrast to standard generative learning, a Boltzmann Generator does not attempt to learn the probability density from data, but is trained to efficiently sample using the dimensionless energy as an input. A Boltzmann Generator consists of two parts:

  1. A generative model that is trained to propose samples from a probability distribution which is “similar” to , and that allows us to evaluate (up to a constant) for every .

  2. A reweighting procedure that takes proposals from and generates unbiased samples from ().

Boltzmann Generators use a trainable generative network which maps latent space samples

from a simple prior, e.g., a Gaussian normal distribution, to samples

. Training is done by combining the energy-based training using the KL divergence, Eq. (17), and maximum likelihood, Eq. (18).

For both, training and reweighting, we need to be able to compute the probability of generating a configuration . This can be achieved by the change-of-variables equation if is an invertible transformation (Fig. 7a) [91, 90]. Such invertible networks are called flows due to the analogy of the transformed probability density with a fluid [90, 125]. In [37], the non-volume preserving transformations RealNVP were employed [125], but the development of more powerful invertible network architectures is an active field of research [126, 93, 127]. By stacking multiple invertible “blocks”, a deep invertible neural network is obtained that can encode a complex transformation of variables.

Fig. 7c-h illustrate the Boltzmann Generator on a condensed-matter model system which contains a bistable dimer in a box densely filled with repulsive solvent particles (Fig. 7

c,d). Opening or closing the dimer is a rare event, and also involves the collective rearrangement of the solvent particles due to the high particle density. Using short MD simulation in the open and closed states as an initialization, the Boltzmann Generator can be trained to sample open, closed and the previously unseen transition states by generating Gaussian random variables in latent space (Fig.

7e) and feeding them through the transformation . Such samples have realistic structures and close to equilibrium energies (Fig. 7f). By employing reweighting, free energy differences can be computed (Fig. 7g). There is a direct relationship between the temperature of the canonical ensemble and the variance of the latent-space Gaussian of the Boltzmann Generator [37]. This allows us to learn to generate thermodynamics, such as the temperature-dependent free energy profiles, using a single Boltzmann Generator (Fig. 7

g). Finally, as the latent space concentrates configurations of equilibrium probability around the origin, Boltzmann Generators can be used to generate physically realistic reaction pathways by performing linear interpolations in latent space.

Figure 7: Boltzmann Generators: a) A Boltzmann Generator is trained by minimizing the difference between its generated distribution and the desired Boltzmann distribution. Generation proceeds by drawing “latent” space samples from a simple prior distribution (e.g., Gaussian) and transforming them to configurations via an invertible Flow: a deep neural network and its inverse, . To compute thermodynamics, such as configurational free energies, the samples must be reweighted to the Boltzmann distribution. b) Repulsive particle system with bistable dimer. Closed (blue) and open (red) configurations from MD simulations (input data). c) Distribution of MD simulation data in latent space coordinates after training Boltzmann Generator. d) Potential energy distribution from MD (grey) and Boltzmann generator for closed (blue), open (red) and transition configurations (yellow). Insets show one-shot Boltzmann Generator samples. e) Free energy differences as a function of dimer distance and relative temperature sampled with Boltzmann generators (green bullets) and umbrella sampling (black lines). Figure modified from [37].

5 Discussion

Despite rapid advances in the field of Machine Learning for Molecular Simulation, there are still significant open problems that need to be addressed, in all the areas discussed above.

5.1 Accuracy and efficiency in quantum-chemical energies and forces

In order to be practically useful, a ML model for both PES and atomic forces is needed that: i) can yield accuracy of kcal/mol for the energy per functional group and about kcal/mol/Å for the force per atom; ii) is not much more expensive to evaluate than classical force fields, iii) scales to large molecules such as proteins; and iv) is transferable to different covalent and non-covalent environments. Such universal model does not exist yet.

Crucial steps towards i-ii) have been recently taken by symmetrized gradient-domain machine learning (sGDML), a kernel-based approach to constructing molecular force fields [57, 10, 128, 129]. Currently, sGDML already enables MD simulations with electrons and nuclei treated at essentially exact quantum-mechanical level for molecules with up to 20-30 atoms.

Network-based approaches, such as Schnet, ANI etc, are better suited to iii-iv), as they break down the energy in local interactions of atoms with their environment, thus enabling a “building block” principle that is by design better scalable to molecules of different size and transferable across chemical space. However, these approaches do not reach the high accuracy in configuration space that sGDML does. Combining high accuracy in configuration and chemical space remains an active research topic.

5.2 Long-ranged interactions

The vast majority of approaches to make ML inference on molecular structures are based on local chemical information. Current neural networks for modeling molecular energies use the summation principle (e.g., Eq. 19) in order to sum up local energies of atom with its neighbors. While multi-body and long-ranged energies can be obtained by stacking multiple layers [14, 54, 96] – the working principle of deep convolution networks [112] – there are fundamental physical limits of this approach: Long-ranged interactions such as electrostatics cannot be cut off.

For classical point-charge models, long-ranged electrostatics methods have been developed, such as the Ewald summation method for periodic systems [130]. One option is to combine short-ranged ML models with such long-ranged electrostatics methods. In order to avoid double counting interactions, one must also predict atomic charges, which is an active field of research [131, 132]. An alternative option, and currently unexplored territory, is to develop neural network structures for particle interactions that can compute long-ranged interactions by design.

In addition to electrostatics, van der Waals (vdW) dispersion interactions can also have a substantial long-range character, i.e. they can extend to separations of tens of nanometers or more in large molecular and nanoscale systems [133, 134, 135]. Developing ML models that correctly treat the quantum-mechanical many-body nature of vdW interactions remains a difficult challenge to overcome [15].

5.3 Quantum Kinetics

With the availability of chemically transferable ML models that have quantum-chemical accuracy, the next open problem is to sample metastable states and long timescale kinetics. Although available ML models for predicting QM energies and forces are still significantly slower than force fields, the vast array of enhanced sampling methods and kinetic models (Sec. 2.4,4.4) will likely allow us to explore kinetics of quantum chemical systems on timescales of microseconds and beyond. A plethora of new physical insights that we cannot access with current MD force fields awaits us there. For example, what is the role of protonation dynamics in mediating protein folding or function?

5.4 Transferability of coarse-grained models

An outstanding question in the design of coarse-grained models is that of transferability across chemical space. Bottom-up coarse-grained models are useful in practice if they can be parametrized on small molecules and then used to predict the dynamics of systems much larger than what is possible to simulate with atomistic resolution. It is not clear to what extent transferability of coarse-grained models can be achieved, and how that depends on the coarse-graining mapping [136, 137]. Compared to the manual design of few-body free energy functionals, machine-learned free energies can help with transferability, as they are able to learn the important multi-body effects, e.g., to model neglected solvent molecules implicitly (Sec. 4.3) [16, 17, 18].

It is natural to consider Behler-Parrinello type networks or SchNet as a starting point for modeling transferable coarse-grained energies, but their application is nontrivial: it is a priori unclear what the interacting particles are in the coarse-grained model and how to define their “types”, as they are no longer given by the chemical element. Furthermore, these networks assume permutation invariance between identical particles, while classical MD force fields do not have permutation invariance of atoms within the same molecule. Therefore, particle network structures that can handle bonding graphs need to be developed.

5.5 Kinetics of coarse-grained models

While coarse-grained MD models may perform well in reproducing the thermodynamics of the atomistic system, they may fail in reproducing the kinetics. Existing approaches include adding fictitious particles [138], or training the coarse-grained model with spectral matching [139]. There is indication that the kinetics can be approximately up to a global scaling factor in barrier-crossing problems when the barriers are well approximated [140], which could be achieved by identifying the slow reaction coordinates [59], and assigning more weight to the transition state in force matching or relative entropy minimization. This area of research is still underdeveloped.

5.6 Transferable prediction of intensive properties

Extensive properties such as potential energies can be well predicted across chemical space, as they can be conceptually broken down as a sum of parts that can be learnt separately. This is not possible with intensive properties such as spectra or kinetics, and for this reason the prediction of such properties is, as yet, far behind.

5.7 Equivariant generative networks with parameter sharing

Generative networks, such as Boltzmann Generators (Sec. 4.5) have been demonstrated to be able to generate physically realistic one-shot samples of model systems and proteins in implicit solvent [37]. In order to scale to larger systems, important steps are to build the invariances of the energy, such as the exchange of identical solvent particles, into the transformation, and to include parameter sharing (Sec. 3.5), such that we can go beyond just sampling the probability density of one given system with energy and instead generalize from a dataset of examples of one class of molecules, e.g. solvated proteins. To this end, equivariant networks with parameter sharing need to be developed for generative learning, which are, to date, not available.

5.8 Explainable AI

Recently, the increasing popularity of explainable AI methods (see e.g. [141, 142, 143, 144]) have allowed us to gain insight into the inner workings of deep learning algorithms. In this manner, it has become possible to extract how a problem is solved by the deep model. This allows for example to detect so-called ‘clever Hans’ solutions [143], i.e. nonsensical solutions relying on artifactual or nonphysical aspects in data. Combined with networks that learn a representation such as DTNN/Schnet [14, 96] and VAMPnets [21], these inspection methods may provide scientific insights into the mechanisms that give rise to the predicted physicochemical quantity, and thus fuel the development of new theories.


We gratefully acknowledge funding from European Commission (ERC CoG 772230 “ScaleCell” to F.N. and ERC-CoG grant BeStMo to A.T.), Deutsche Forschungsgemeinschaft (CRC1114/A04 to F.N., EXC 2046/1, Project ID 390685689 to K.-R.M., GRK2433 DAEDALUS to F.N. and K.-R.M.), the MATH+ Berlin Mathematics research center (AA1-8 to F.H., EF1-2 to F.N. and K.-R.M.), Einstein Foundation Berlin (Einstein Visiting Fellowship to C.C.), the National Science Foundation (grants CHE-1265929, CHE-1740990, CHE-1900374, and PHY-1427654 to C.C.), the Welch Foundation (grant C-1570 to C.C.), the Institute for Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (No. 2017-0-00451, No. 2017-0-01779 to K.-R.M.), and the German Ministry for Education and Research (BMBF) (Grants 01IS14013A-E, 01GQ1115 and 01GQ0850 to K.-R. M.). All authors thank Stefan Chmiela and Kristof Schütt for help with Figures 1 and 5.


  • [1] P. A. M. Dirac. Quantum mechanics of many-electron systems. Proc. Royal Soc. A, 123(792):714–733, 1929.
  • [2] K.-R. Müller, S. Mika, G. Rätsch, K. Tsuda, and B. Schölkopf. An introduction to kernel-based learning algorithms. IEEE Trans. Neural Netw., 12:181–201, 2001.
  • [3] V. N. Vapnik. An overview of statistical learning theory. IEEE Trans. Neur. Net., 10, 1999.
  • [4] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
  • [5] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521:436–444, 2015.
  • [6] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
  • [7] J. Behler and M. Parrinello. Generalized neural-network representation of high-dimensional potential-energy surfaces. Phys. Rev. Lett., 98:146401, 2007.
  • [8] M. Rupp, A. Tkatchenko, K.-R. Müller, and O. A. Von Lilienfeld. Fast and accurate modeling of molecular atomization energies with machine learning. Phys. Rev. Lett., 108:058301, 2012.
  • [9] A. P. Bartók, M. C. Payne, R. Kondor, and G. Csányi. Gaussian approximation potentials: The accuracy of quantum mechanics, without the electrons. Phys. Rev. Lett., 104:136403, 2010.
  • [10] S. Chmiela, H. E. Sauceda, K.-R. Müller, and A. Tkatchenko. Towards exact molecular dynamics simulations with machine-learned force fields. Nat. Commun., 9:3887, 2018.
  • [11] J. S. Smith, O. Isayev, and A. E. Roitberg. Ani-1: an extensible neural network potential with dft accuracy at force field computational cost. Chem. Sci., 8:3192–3203, 2017.
  • [12] J. S. Smith, B. T. Nebgen, R. Zubatyuk, N. Lubbers, C. Devereux, K. Barros, S. Tretiak, O. Isayev, and A. Roitberg.

    Outsmarting quantum chemistry through transfer learning., 2018.
  • [13] F. Brockherde, L. Vogt, L. Li, M. E. Tuckerman, K. Burke, and K.-R. Müller. Bypassing the kohn-sham equations with machine learning. Nat. Commun., 8:872, 2017.
  • [14] K. T. Schütt, F. Arbabzadah, S. Chmiela, K.-R. Müller, and A. Tkatchenko. Quantum-chemical insights from deep tensor neural networks. Nat. Commun., 8:13890, 2017.
  • [15] T. Bereau, R. A. DiStasio Jr., A. Tkatchenko, and O. A. von Lilienfeld. Non-covalent interactions across organic and biological subsets of chemical space: Physics-based potentials parametrized from machine learning. J. Chem. Phys., 148:241706, 2018.
  • [16] S. T. John and G. Csányi. Many-body coarse-grained interactions using gaussian approximation potentials. J. Phys. Chem. B, 121:10934–10949, 2017.
  • [17] L. Zhang, J. Han, H. Wang, R. Car, and W. E. DeePCG: constructing coarse-grained models via deep neural networks. J. Chem. Phys., 149:034101, 2018.
  • [18] J. Wang, S. Olsson, C. Wehmeyer, A. Pérez, N. E. Charron, G. De Fabritiis, F. Noé, and C. Clementi. Machine learning of coarse-grained molecular dynamics force fields. ACS Cent. Sci., 5:755–767, 2019.
  • [19] C. Wehmeyer and F. Noé. Time-lagged autoencoders: Deep learning of slow collective variables for molecular kinetics. J. Chem. Phys., 148:241703, 2018.
  • [20] C. X. Hernández, H. K. Wayment-Steele, M. M. Sultan, B. E. Husic, and V. S. Pande. Variational encoding of complex dynamics. Phys. Rev. E, 97:062412, 2018.
  • [21] A. Mardt, L. Pasquali, H. Wu, and F. Noé. VAMPnets: Deep learning of molecular kinetics. Nat. Commun., 9:5, 2018.
  • [22] J. M. L. Ribeiro, P. Bravo, Y. Wang, and P. Tiwary. Reweighted autoencoded variational bayes for enhanced sampling (rave). J. Chem. Phys., 149:072301, 2018.
  • [23] W. Chen, H. Sidky, and A. L. Ferguson. Nonlinear discovery of slow molecular modes using state-free reversible vampnets. J. Chem. Phys., 150:214114, 2019.
  • [24] H. Jung, R. Covino, and G. Hummer. Artificial intelligence assists discovery of reaction coordinates and mechanisms from molecular dynamics simulations. arXiv:1901.04595, 2019.
  • [25] T. Stecher, N. Bernstein, and G. Csányi. Free energy surface reconstruction from umbrella samples using gaussian process regression. J. Chem. Theory Comput., 10:4079–4097, 2014.
  • [26] L. Mones, N. Bernstein, and G. Csányi. Exploration, sampling, and reconstruction of free energy surfaces with gaussian process regression. J. Chem. Theory Comput., 12:5100–5110, 2016.
  • [27] E. Schneider, L. Dai, R. Q. Topper, C. Drechsel-Grau, and M. E. Tuckerman. Stochastic neural network approach for learning high-dimensional free energy surfaces. Phys. Rev. Lett., 119(150601), 2017.
  • [28] H. Wu, A. Mardt, L. Pasquali, and F. Noé. Deep generative markov state models. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Adv. Neural Inf. Process. Syst. 31, pages 3975–3984. Curran Associates, Inc., 2018.
  • [29] O. Valsson and M. Parrinello. Variational approach to enhanced sampling and free energy calculations. Phys. Rev. Lett., 113:090601, 2014.
  • [30] L. Bonati, Y.-Y. Zhang, and M. Parrinello. Neural networks based variationally enhanced sampling. arXiv:1904.01305, 116:17641–17647, 2019.
  • [31] J. Zhang, Y. I. Yang, and F. Noé. Targeted adversarial learning optimized sampling. J. Phys. Chem. Lett., 10:5791–5797, 2019.
  • [32] J. McCarty and M. Parrinello. A variational conformational dynamics approach to the selection of collective variables in metadynamics. J. Chem. Phys., 147:204109, 2017.
  • [33] M. M. Sultan and V. S. Pande. tICA-metadynamics: Accelerating metadynamics by using kinetically selected collective variables. J. Chem. Theory Comput., 13:2440–2447, 2017.
  • [34] S. Doerr and G. De Fabritiis. On-the-fly learning and sampling of ligand binding by high-throughput molecular simulations. J. Chem. Theory Comput., 10:2064–2069, 2014.
  • [35] M. I. Zimmerman and G. R. Bowman. FAST conformational searches by balancing exploration/exploitation trade-offs. J. Chem. Theory Comput., 11:5747–5757, 2015.
  • [36] N. Plattner, S. Doerr, G. De Fabritiis, and F. Noé. Protein-protein association and binding mechanism resolved in atomic detail. Nat. Chem., 9:1005–1011, 2017.
  • [37] Frank Noé, Simon Olsson, Jonas Köhler, and Hao Wu. Boltzmann generators - sampling equilibrium states of many-body systems with deep learning. Science, 365:eaaw1147, 2019.
  • [38] J. Jiménez, M. Skalic, G. Martinez-Rosell, and G. De Fabritiis. Kdeep: Protein–ligand absolute binding affinity prediction via 3d-convolutional neural networks. J. Chem. Inf. Model., 58:287–296, 2018.
  • [39] M. Skalic, A. Varela-Rial, J. Jiménez, G. Martínez-Rosell, and G. De Fabritiis. Ligvoxel: inpainting binding pockets using 3d-convolutional neural networks. Bioinformatics, 35:243–250, 2018.
  • [40] Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K Leswing, and V. Pande. Moleculenet: a benchmark for molecular machine learning. Chem. Sci., 9:513–530, 2018.
  • [41] E. N. Feinberg, D. Sur, Z. Wu, B. E. Husic, H. Mai, Y. Li, S. Sun, J. Yang, B. Ramsundar, and V. S. Pande. Potentialnet for molecular property prediction. ACS Cent. Sci., 4:1520–1530, 2018.
  • [42] R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernández-Lobato, B. Sánchez-Lengeling, D. Sheberla, J. Aguilera-Iparraguirre, T. D Hirzel, R. P Adams, and A. Aspuru-Guzik. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci., 4:268–276, 2018.
  • [43] M. Popova, O. Isayev, and A. Tropsha.

    Deep reinforcement learning for de novo drug design.

    Sci. Adv., 4:eaap7885, 2018.
  • [44] R. Winter, F. Montanari, A. Steffen, H. Briem, F. Noé, and D. A. Clevert. Efficient multi-objective molecular optimization in a continuous latent space. Chem. Sci., 10:8016–8024, 2019.
  • [45] R. Winter, F. Montanari, F. Noé, and D.-A. Clevert. Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chem. Sci., 10:1692–1701, 2019.
  • [46] K. T. Butler, D. W. Davies, H. Cartwright, O. Isayev, and A. Walsh. Machine learning for molecular and materials science. Nature, 559:547–555, 2018.
  • [47] B. Sanchez-Lengeling and A. Aspuru-Guzik. Inverse molecular design using machine learning: Generative models for matter engineering. Science, 361:360–365, 2018.
  • [48] Kresten Lindorff-Larsen, Paul Maragakis, Stefano Piana, Michael P. Eastwood, Ron O. Dror, and David E. Shaw. Systematic validation of protein force fields against experimental data. PLoS One, 7:e32131, 2012.
  • [49] P. S. Nerenberg, B. Jo, C. So, A. Tripathy, and T. Head-Gordon. Optimizing solute-water van der waals interactions to reproduce solvation free energies. J. Phys. Chem. B, 116:4524–4534, 2012.
  • [50] J. Huang, S. Rauscher, G. Nawrocki, T. Ran, M. Feig, B. L. de Groot, H. Grubmüller, and A. D MacKerell. CHARMM36m: an improved force field for folded and intrinsically disordered proteins. Nat. Methods, 14:71–73, 2016.
  • [51] P. Robustelli, S. Piana, and D. E. Shaw. Developing a molecular dynamics force field for both folded and disordered protein states. Proc. Natl. Acad. Sci. USA, 115:E4758–E4766, 2018.
  • [52] J. Behler. Perspective: Machine learning potentials for atomistic simulations. J. Chem. Phys., 145:170901, 2016.
  • [53] Z. Li, J. R. Kermode, and A. De Vita. Molecular dynamics with on-the-fly machine learning of quantum-mechanical forces. Phys. Rev. Lett., 114:96405, 2015.
  • [54] K. T. Schütt, H. E. Sauceda, P.-J. Kindermans, A. Tkatchenko, and K.-R. Müller. SchNet - a deep learning architecture for molecules and materials. J. Chem. Phys., 148:241722, 2018.
  • [55] M. Gastegger, J. Behler, and P. Marquetand. Machine learning molecular dynamics for the simulation of infrared spectra. Chem. Sci., 8:6924–6935, 2017.
  • [56] P. O. Dral, A. Owens, S. N. Yurchenko, and W. Thiel. Structure-based sampling and self-correcting machine learning for accurate calculations of potential energy surfaces and vibrational levels. J. Chem. Phys., 146:244108, 2017.
  • [57] S. Chmiela, A. Tkatchenko, H. E. Sauceda, I. Poltavsky, K. T. Schütt, and K.-R. Müller. Machine learning of accurate energy-conserving molecular force fields. Sci. Adv., 3:e1603015, 2017.
  • [58] J. Han, L. Zhang, R. Car, and W. E. Deep potential: a general representation of a many-body potential energy surface. Phys. Rev. Lett., 120:143001, 2018.
  • [59] F. Noé and C. Clementi. Collective variables for the study of long-time kinetics from molecular trajectories: theory and methods. Curr. Opin. Struc. Biol., 43:141–147, 2017.
  • [60] R. Galvelis and Y. Sugita. Neural network and nearest neighbor algorithms for enhancing sampling of molecular dynamics. J. Chem. Theory Comput., 13:2489–2500, 2017.
  • [61] H. Sidky and J. K. Whitmer. Learning free energy landscapes using artificial neural networks. J. Chem. Phys., 148:104111, 2018.
  • [62] J. M. L. Ribeiro and P. Tiwary. Toward achieving efficient and accurate ligand-protein unbinding with deep learning and molecular dynamics through RAVE. J. Chem. Theory Comput., 15:708–719, 2018.
  • [63] W. Chen and A. L. Ferguson. Molecular enhanced sampling with autoencoders: On-the-fly collective variable discovery and accelerated free energy landscape exploration. J. Comput. Chem., 39:2079–2102, 2018.
  • [64] M. M. Sultan, H. K. Wayment-Steele, and V. S. Pande. Transferable neural networks for enhanced sampling of protein dynamics. J. Chem. Theory Comput., 14:1887–1894, 2018.
  • [65] A. Z. Guo, E. Sevgen, H. Sidky, J. K. Whitmer, J. A. Hubbell, and J. J. de Pablo. Adaptive enhanced sampling by force-biasing using neural networks. J. Chem. Phys., 148:134108, 2018.
  • [66] W. G. Noid, Jhih-Wei Chu, Gary S. Ayton, Vinod Krishna, Sergei Izvekov, Gregory A. Voth, Avisek Das, and Hans C. Andersen. The multiscale coarse-graining method. I. A rigorous bridge between atomistic and coarse-grained models. J. Chem. Phys., 128:244114, 2008.
  • [67] W. G. Noid. Perspective: Coarse-grained models for biomolecular systems. J. Chem. Phys., 139:090901, 2013.
  • [68] G. Ciccotti, T. Lelièvre, and E. Vanden-Eijnden. Projection of diffusions on submanifolds: Application to mean force computation. Commun. Pure Appl. Math., 61:371–408, 2008.
  • [69] C. Clementi. Coarse-grained models of protein folding: Toy-models or predictive tools? Curr. Opin. Struct. Biol., 18:10–15, 2008.
  • [70] L. Boninsegna, R. Banisch, and C. Clementi. A data-driven perspective on the hierarchical assembly of molecular structures. J. Chem. Theory Comput., 14:453–460, 2018.
  • [71] W. Wang and R. Gómez-Bombarelli. Variational coarse-graining for molecular dynamics. arXiv:1812.02706, 2018.
  • [72] M. Sarich, F. Noé, and C. Schütte. On the approximation quality of markov state models. Multiscale Model. Simul., 8:1154–1177, 2010.
  • [73] H. Wu and F. Noé. Variational approach for learning markov processes from time series data. arXiv:1707.04659, 2017.
  • [74] N. V. Buchete and G. Hummer. Coarse Master Equations for Peptide Folding Dynamics. J. Phys. Chem. B, 112:6057–6069, 2008.
  • [75] F. Noé, S. Doose, I. Daidone, M. Löllmann, J. D. Chodera, M. Sauer, and J. C. Smith. Dynamical fingerprints for probing individual relaxation processes in biomolecular dynamics with simulations and kinetic experiments. Proc. Natl. Acad. Sci. USA, 108:4822–4827, 2011.
  • [76] C. Schütte, A. Fischer, W. Huisinga, and P. Deuflhard. A Direct Approach to Conformational Dynamics based on Hybrid Monte Carlo. J. Comput. Phys., 151:146–168, 1999.
  • [77] W. C. Swope, J. W. Pitera, and F. Suits. Describing protein folding kinetics by molecular dynamics simulations: 1. Theory. J. Phys. Chem. B, 108:6571–6581, 2004.
  • [78] F. Noé, I. Horenko, C. Schütte, and J. C. Smith. Hierarchical Analysis of Conformational Dynamics in Biomolecules: Transition Networks of Metastable States. J. Chem. Phys., 126:155102, 2007.
  • [79] J. D. Chodera, K. A. Dill, N. Singhal, V. S. Pande, W. C. Swope, and J. W. Pitera.

    Automatic discovery of metastable states for the construction of Markov models of macromolecular conformational dynamics.

    J. Chem. Phys., 126:155101, 2007.
  • [80] J.-H. Prinz, H. Wu, M. Sarich, B. G. Keller, M. Senne, M. Held, J. D. Chodera, C. Schütte, and F. Noé. Markov models of molecular kinetics: Generation and validation. J. Chem. Phys., 134:174105, 2011.
  • [81] I. Mezić. Spectral properties of dynamical systems, model reduction and decompositions. Nonlinear Dynam., 41:309–325, 2005.
  • [82] P. J. Schmid and J. Sesterhenn. Dynamic mode decomposition of numerical and experimental data. In 61st Annual Meeting of the APS Division of Fluid Dynamics. American Physical Society, 2008.
  • [83] J. H. Tu, C. W. Rowley, D. M. Luchtenburg, S. L. Brunton, and J. N. Kutz. On dynamic mode decomposition: Theory and applications. J. Comput. Dyn., 1:391–421, 2014.
  • [84] H. Wu, F. Nüske, F. Paul, S. Klus, P. Koltai, and F. Noé. Variational koopman models: slow collective variables and molecular kinetics from short off-equilibrium simulations. J. Chem. Phys., 146:154104, 2017.
  • [85] F. Noé and F. Nüske. A variational approach to modeling slow processes in stochastic dynamical systems. Multiscale Model. Simul., 11:635–655, 2013.
  • [86] Frank Noé. Machine learning for molecular dynamics on long timescales. arXiv:1812.07669, 2018.
  • [87] P. Smolensky. Chapter 6: Information processing. In David E. Rumelhart and James L. McLelland, editors, Dynamical Systems: Foundations of Harmony Theory, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations, pages 194–281. MIT Press., 1986.
  • [88] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In Proceedings of the 2nd International Conference on Learning Representations (ICLR), arXiv:1312.6114, 2014.
  • [89] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and J. Bengio. Generative adversarial networks. In NIPS’14 Proceedings of the 27th International Conference on Neural Information Processing Systems, volume 2, pages 2672–2680, MA, USA, 2014. MIT Press Cambridge.
  • [90] L. Dinh, D. Krueger, and Y. Bengio. Nice: Nonlinear independent components estimation. arXiv:1410.8516, 2015.
  • [91] E. G. Tabak and E. Vanden-Eijnden. Density estimation by dual ascent of the log-likelihood. Commun. Math. Sci., 8:217–233, 2010.
  • [92] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive Growing of GANs for Improved Quality, Stability, and Variation. Proceedings of the 6th International Conference on Learning Representations (ICLR), arXiv:1710.10196, 2018.
  • [93] D. P. Kingma and P. Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. NIPS, 2018.
  • [94] A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. van den Driessche, E. Lockhart, L. C. Cobo, F. Stimberg, N. Casagrande, D. Grewe, S. Noury, S. Dieleman, E. Elsen, N. Kalchbrenner, H. Zen, A. Graves, H. King, T. Walters, D. Belov, and D. Hassabis. Parallel WaveNet: Fast High-Fidelity Speech Synthesis. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3918–3926, Stockholmsmässan, Stockholm Sweden, 2018. PMLR.
  • [95] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proc. IEEE, 86:2278–2324, 1998.
  • [96] Kristof Schütt, Pieter-Jan Kindermans, Huziel Enoc Sauceda, Stefan Chmiela, Alexandre Tkatchenko, and Klaus-Robert Müller. Schnet: A continuous-filter convolutional neural network for modeling quantum interactions. In Adv. Neural Inf. Process. Syst., pages 991–1001, 2017.
  • [97] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. Deep sets. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Adv. Neural Inf. Process. Syst., pages 3391–3401. Curran Associates, Inc., 2017.
  • [98] F. Ercolessi, E. Tosatti, and M. Parrinello. Au (100) surface reconstruction. Phys. Rev. Lett., 57:719, 1986.
  • [99] J. Tersoff. New empirical model for the structural properties of silicon. Phys. Rev. Lett., 56:632, 1986.
  • [100] J. Ferrante, J. R. Smith, and J. H. Rose. Diatomic molecules and metallic adhesion, cohesion, and chemisorption: A single binding-energy relation. Phys. Rev. Lett., 50:1385, 1983.
  • [101] G. C. Abell. Empirical chemical pseudopotential theory of molecular and metallic bonding. Phys. Rev. B, 31:6184, 1985.
  • [102] H. W. Kuhn. The hungarian method for the assignment problem. Nav. Res. Logist. Quart., 2:83–97, 1955.
  • [103] F. Reinhard and H. Grubmüller. Estimation of absolute solvent and solvation shell entropies via permutation reduction. J. Chem. Phys., 126:014102, 2007.
  • [104] J. S. Smith, O. Isayev, and A. E. Roitberg. Ani-1, a data set of 20 million calculated off-equilibrium conformations for organic molecules. Sci. Data, 4:170193, 2017.
  • [105] K. T. Schütt, P. Kessel, M. Gastegger, K. A. Nicoli, A. Tkatchenko, and K.-R. Müller. Schnetpack: A deep learning toolbox for atomistic systems. J. Chem. Theory Comput., 15:448–455, 2019.
  • [106] V. Vapnik. The nature of statistical learning theory. Springer, 1995.
  • [107] K. Hansen, G. Montavon, F. Biegler, S. Fazli, M. Rupp, M. Scheffler, O. Anatole von Lilienfeld, A. Tkatchenko, and K.-R. Müller. Assessment and validation of machine learning methods for predicting molecular atomization energies. J. Chem. Theory Comput., 9:3404–3419, 2013.
  • [108] K. Hansen, F. Biegler, R. Ramakrishnan, W. Pronobis, O. A. Von Lilienfeld, K.-R. Müller, and A. Tkatchenko. Machine learning predictions of molecular properties: Accurate many-body potentials and nonlocality in chemical space. J. Phys. Chem. Lett., 6:2326–2331, 2015.
  • [109] A. P. Bartók, R. Kondor, and G. Csányi. On representing chemical environments. Phys. Rev. B, 87:184115, 2013.
  • [110] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Adv. Neural Inf. Process. Syst., pages 3111–3119, 2013.
  • [111] M. L. Braun, J. M. Buhmann, and K. R. Müller. On relevant dimensions in kernel feature spaces. J. Mach. Learn. Res., 9:1875–1906, 2008.
  • [112] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Comput., 1:541–551, 1989.
  • [113] Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller. Efficient backprop. In J. B. Orr and K.-R. Müller, editors, Neural Networks: Tricks of the Trade, volume 1524, pages 9–53. Springer LNCS, 1998.
  • [114] G. Perez-Hernandez, F. Paul, T. Giorgino, G. D Fabritiis, and Frank Noé. Identification of slow molecular order parameters for markov model construction. J. Chem. Phys., 139:015102, 2013.
  • [115] P. Koltai, G. Ciccotti, and Ch. Schütte. On metastability and markov state models for non-stationary molecular dynamics. J. Chem. Phys., 145:174103, 2016.
  • [116] P. Koltai, H. Wu, F. Noé, and C. Schütte. Optimal data-driven estimation of generalized markov state models for non-equilibrium dynamics. Computation, 6:22, 2018.
  • [117] Q. Li, F. Dietrich, E. M. Bollt, and I. G. Kevrekidis. Extended dynamic mode decomposition with dictionary learning: A data-driven adaptive spectral decomposition of the koopman operator. Chaos, 27:103111, 2017.
  • [118] F. Bießmann, F. C. Meinecke, A. Gretton, A. Rauch, G. Rainer, N. K. Logothetis, and K.-R. Müller. Temporal kernel cca and its application in multimodal neuronal data analysis. Machine Learning, 79:5–27, 2010.
  • [119] R. T. McGibbon and V. S. Pande. Variational cross-validation of slow dynamical modes in molecular kinetics. J. Chem. Phys., 142:124105, 2015.
  • [120] B. E. Husic, R. T. McGibbon, M. M. Sultan, and V. S. Pande. Optimized parameter selection reveals trends in markov state models for protein folding. J. Chem. Phys., 145:194103, 2016.
  • [121] M. K. Scherer, B. E. Husic, M. Hoffmann, F. Paul, H. Wu, and F. Noé. Variational selection of features for molecular kinetics. J. Chem. Phys., 150:194108, 2019.
  • [122] F. Noé, C. Schütte, E. Vanden-Eijnden, L. Reich, and T. R. Weikl. Constructing the full ensemble of folding pathways from short off-equilibrium simulations. Proc. Natl. Acad. Sci. USA, 106:19011–19016, 2009.
  • [123] P. Deuflhard and M. Weber.

    Robust perron cluster analysis in conformation dynamics.

    In M. Dellnitz, S. Kirkland, M. Neumann, and C. Schütte, editors, Linear Algebra Appl., volume 398C, pages 161–184. Elsevier, New York, 2005.
  • [124] S. Röblitz and M. Weber. Fuzzy spectral clustering by PCCA+: application to Markov state models and data classification. Adv. Data Anal. Classif., 7:147–179, 2013.
  • [125] S. Bengio L. Dinh, J. Sohl-Dickstein. Density estimation using real nvp. arXiv:1605.08803, 2016.
  • [126] D. J. Rezende and S. Mohamed. Variational inference with normalizing flows. arXiv:1505.05770, 2015.
  • [127] W. Grathwohl, R. T. Q. Chen, J. Bettencourt, I. Sutskever, and D. Duvenaud. Ffjord: Free-form continuous dynamics for scalable reversible generative models. arXiv:1810.01367, 2018.
  • [128] H. E. Sauceda, S. Chmiela, I. Poltavsky, K.-R. Müller, and A. Tkatchenko. Molecular force fields with gradient-domain machine learning: Construction and application to dynamics of small molecules with coupled cluster forces. J. Chem. Phys, 150:114102, 2019.
  • [129] S. Chmiela, H. E. Sauceda, I. Poltavsky, K.-R. Müller, and A. Tkatchenko. sGDML: Constructing accurate and data efficient molecular force fields using machine learning. Comp. Phys. Comm., 240:38, 2019.
  • [130] T. Darden, L. Perera, L. Li, and L. Pedersen. New tricks for modelers from the crystallography toolkit: the particle mesh Ewald algorithm and its use in nucleic acid simulations. Structure, 7:55–60, 1999.
  • [131] O. T. Unke and M. Meuwli.

    Physnet: A neural network for predicting energies, forces, dipole moments and partial charges.

    J. Chem. Theory Comput., 15:3678–3693, 2019.
  • [132] B. Nebgen, N. Lubbers, J. S. Smith, A. E. Sifain, A. Lokhov, O. Isayev, A. E. Roitberg, K. Barros, and S. Tretiak. Transferable dynamic molecular charge assignment using deep neural networks. J. Chem. Theo. Comput., 14:4687–4698, 2018.
  • [133] A. Ambrosetti, N. Ferri, R. A. DiStasio Jr., and A. Tkatchenko. Wavelike charge density fluctuations and van der waals interactions at the nanoscale. Science, 351:1171–1176, 2016.
  • [134] J. Hermann, A. R. DiStasio Jr., and A. Tkatchenko. First-principles models for van der waals interactions in molecules and materials: Concepts, theory, and applications. Chem. Rev., 117:4714–4758, 2017.
  • [135] M. Stoehr aand T. Van Voorhis and A. Tkatchenko. Theory and practice of modeling van der waals interactions in electronic-structure calculations. Chem. Soc. Rev., 48:4118–4154, 2019.
  • [136] J. W. Mullinax and W. G. Noid. Extended ensemble approach for deriving transferable coarse-grained potentials. J. Chem. Phys., 131:104110, 2009.
  • [137] I. F. Thorpe, D. P. Goldenberg, and G. A. Voth. Exploration of transferability in multiscale coarse-grained peptide models. J. Phys. Chem. B, 115(41):11911–11926, 2011.
  • [138] A. Davtyan, G. A. Voth, and H. C. Andersen. Dynamic force matching: Construction of dynamic coarse-grained models with realistic short time dynamics and accurate long time dynamics. J. Chem. Phys., 145:224107, 2016.
  • [139] F. Nüske, L. Boninsegna, and C. Clementi. Coarse-graining molecular systems by spectral matching. J. Chem. Phys., 151:044116, 2019.
  • [140] T. Bereau and J. F. Rudzinski. Accurate structure-based coarse graining leads to consistent barrier-crossing dynamics. Phys. Rev. Lett., 121:256002, 2018.
  • [141] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10:e0130140, 2015.
  • [142] G. Montavon, W. Samek, and K.-R. Müller. Methods for interpreting and understanding deep neural networks. Digit. Signal Process., 73:1–15, 2018.
  • [143] S. Lapuschkin, S. Wäldchen, A. Binder, G. Montavon, W. Samek, and K.-R. Müller. Unmasking clever hans predictors and assessing what machines really learn. Nat. Commun., 10:1096, 2019.
  • [144] W. Samek, G. Montavon, A. Vedaldi, L.-K. Hansen, and K.-R. Müller. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. volume 11700 of

    Lecture Notes Artificial Intelligence

    . Springer, 2019.