1 Introduction
In 1929 Paul Dirac stated that:
“The underlying physical laws necessary for the mathematical theory of a large part of physics and the whole of chemistry are thus completely known, and the difficulty is only that the exact application of these laws leads to equations much too complicated to be soluble. It therefore becomes desirable that approximate practical methods of applying quantum mechanics should be developed, which can lead to an explanation of the main features of complex atomic systems without too much computation.” [1]
Ninety years later, this quote is still state of the art. However, in the last decade, new tools from the rapidly developing field of machine learning (ML) have started to make significant impact on the development of approximate methods for complex atomic systems, bypassing the direct solution of “equations much too complicated to be soluble”.
ML aims at extracting complex patterns and relationships from large data sets, to predict specific properties of the data. A classical application of machine learning is to the problem of image classification where descriptive labels need to be associated to images that are presented in terms of sets of pixels. The “machine” is trained on a large number of examples and “learns” how to classify new images. The underlying idea is that there exists a complex relationship between the input (the pixels) and the output (the labels) that is unknown in its explicit form but can be inferred by a suitable algorithm. Clearly, such an operating principle can be very useful in the description of atomic and molecular systems as well. We know that atomistic configurations dictate the chemical properties, and the machine can learn to associate the latter to the former without solving first principle equations, if presented with enough examples.
Although different machine learning tools are available and have been applied to molecular simulation (e.g., kernel methods [2]
), here we mostly focus on the use of neural networks, now often synonymously used with the term “deep learning”. We assume the reader has basic knowledge of machine learning and we refer to the literature for an introduction to statistical learning theory
[3, 4] and deep learning [5, 6].One of the first applications of machine learning in Chemistry has been to extract classical potential energy surfaces from quantum mechanical (QM) calculations, in order to efficiently perform molecular dynamics (MD) simulations that can incorporate quantum effects. The seminal work of Behler and Parrinello in this direction [7] has opened the way to a now rapidly advancing area of research [8, 9, 10, 11, 12, 13, 14, 15]. In addition to atomistic force fields, it has been recently shown that, in the same spirit, effective molecular models at resolution coarser than atomistic can be designed by ML [16, 17, 18]. Analysis and simulation of MD trajectories has also been affected by ML, for instance for the definition of optimal reaction coordinates [19, 20, 21, 22, 23, 24]
, the estimate of free energy surfaces
[25, 26, 27, 22], the construction of Markov State Models [21, 23, 28], and for enhancing MD sampling by learning bias potentials [29, 30, 31, 32, 33]or selecting starting configurations by active learning
[34, 35, 36]. Finally, ML can be used to generate samples from the equilibrium distribution of a molecular system without performing MD altogether, as proposed in the recently introduced Boltzmann Generators [37]. A selection of these topics will be reviewed and discussed in the following.All these different aspects of molecular simulation have evolved independently so far. For instance, MLgenerated forcefields have mostly been developed and applied on small molecules and ordered solids, while the analysis of MD trajectories is mostly relevant for the simulation of large flexible molecules, like proteins. In order to really revolutionize the field, these tools and methods need to evolve, become more scalable and transferable, and converge into a complete pipeline for the simulation and analysis of molecular systems. There are still some significant challenges towards this goal, as we will discuss in the following, but considering the rapid progress of the last few years, we envision that in the near future machine learning will have transformed the way molecular systems are studied in silico.
This review focuses on physicsbased ML approaches for molecular simulation. ML is also having a big impact in other areas of Chemistry without involving a physicsbased model, for example by directly attempting to make predictions of physicochemical or pharmaceutical properties [38, 39, 40, 41], or to designing materials and molecules with certain desirable properties using generative learning [42, 43, 44, 45]. We refer the interested reader to other recent reviews on the subject [46, 47].
The review is organized as follows. We start by describing the most important machine learning problems and principles for molecular simulation (Sec. 2). A crucial aspect of the application of ML in molecular simulations is to incorporate physical constraints and we will discuss this for the most commonly used physical symmetries and invariances for molecular systems (Sec. 3). We will then provide examples of specific ML methods and applications for molecular simulation tasks, focusing on deep learning and the neural network architectures involved (Sec. 4). We conclude by outlining open challenges and pointing out possible approaches to their solution (Sec. 5).
2 Machine Learning Problems for Molecular Simulation
In this section we discuss how several of the open challenges in the simulation of molecular systems can be formulated as machine learning problems, and the recent efforts to address them.
2.1 Potential energy surfaces
Molecular dynamics (MD) and Markov Chain Monte Carlo (MCMC) simulations employing classical force fields within the BornOppenheimer approximation constitute the cornerstone of contemporary atomistic modeling in chemistry, biology, and materials science. These methods perform importance sampling, i.e. in the long run, they sample states
from the molecular system’s equilibrium distribution which has the general form:(1) 
The reduced potential contains terms depending on the thermodynamic constraints (e.g. fixed temperature, pressure, etc.). In the canonical ensemble (fixed number of particles, volume and temperature), where is the thermal energy at temperature .
However, the predictive power of these simulations is only as good as the underlying potential energy surface (PES). Hence, predictive simulations of properties and functions of molecular systems require an accurate description of the global PES, , where indicates the nuclear Cartesian coordinates. All manybody interactions between electrons are encoded in the function. Although could be obtained on the fly using explicit ab initio calculations, more efficient approaches that can access long time scales are required to understand relevant phenomena in large molecular systems. A plethora of classical mechanistic approximations to exist, in which the parameters are typically fitted to a small set of ab initio calculations or experimental data [48, 49, 50, 51]. Unfortunately, these classical approximations often suffer from the lack of transferability and can yield accurate results only close to the conditions (geometries) they have been fitted to.
Alternatively, sophisticated ML approaches that can accurately reproduce the global potential energy surface (PES) for elemental materials [7, 52, 9, 53, 14, 54] and small molecules [55, 56, 11, 10, 57, 54, 58] have been recently developed (see Fig. 1). Such methods learn a model of the PES, , where the parameters
are optimized either by energy matching and/or force matching. In energy matching, an ML model is trained to minimize the loss function:
(2) 
where are energy values obtained by QM calculations at specific configurations (see Fig. 1). For force matching, we compute the QM forces at specified configurations
(3) 
and minimize the loss function:
(4) 
The existing ML PES models are based either on nonlinear kernel learning [9, 53, 57, 10] or (deep) neural networks [52, 14, 58]. Specific neural network architectures will be discussed in Sec. 4.1 and 4.2. Both approaches have advantages and limitations. The advantage of the kernel methods is their convex nature yielding a unique solution, whose behavior can be controlled outside of the training data. Neural networks are nonconvex ML models and often harder to interpret and generalize outside of the training manifold.
2.2 Free energy surfaces
Given the Cartesian coordinates of a molecule with atoms, , we define the collective coordinates by the encoding:
(5) 
where and is a small number. In machine learning terms, the coordinates define a latent space. In molecular simulations, the collective coordinates are often defined to describe the slowest processes of the system [59]. In general, the mapping (or encoding) can be highly nonlinear. If the energy function associated with the atomistic representation is known, an important quantity to compute, for instance to connect with experimental measurements, is the free energy of the system as a function of the collective variables. The free energy is defined as:
(6) 
where is the marginal distribution on of the equilibrium distribution given by Eq. (1):
(7) 
The integral in Eq. (7) is in practice impossible to compute analytically for high dimensional systems, and several methods have been developed for its numerical estimation by enhancing the sampling of the equilibrium distribution in molecular dynamics simulations of the system.
The definition of the free energy can also be formulated as a learning problem: the aim is to optimize the parameters of a free energy function model such that Eqs. (67) are satisfied to a good approximation. Fulfilling these equations is usually referred to as enforcing thermodynamic consistency.
Using methods that estimate the free energy (or its gradient ) at a given set of points in the collective variables space (, one can use the free energy loss:
or the free energy gradient loss:
with these free energy estimates in order to reconstruct the entire surface . Both kernel regression [25] and neural networks [27] have been used to this purpose. Machine learning has also been used in combination with enhanced sampling methods to learn the free energy surface onthefly [26, 60, 61, 22, 62, 63, 64, 65].
An alternative way to learn (6) is by using force matching [66, 67]. It has been demonstrated [68] that, given the forces of the atomistic system collected along a molecular dynamic trajectory, , , the gradient of a free energy model that best satisfies Eq. (6) also minimizes the force matching loss:
(8) 
where the term is the local mean force:
that is, the projection of the atomistic force on the collective variable space through the mapping .
In practice, the estimator (8) is very noisy: because of the dimensionality reduction from to , multiple realizations of the projected force can be associated to the same value of the collective coordinates and the minimum of the loss function (8
) can not go to zero. By invoking statistical estimator theory it can be shown that this loss can be broken down into a bias, variance and noise terms
[18].2.3 Coarsegraining
The use of coarsegrained models of complex molecular systems, such as proteins, presents an attractive alternative to atomistic models that are very expensive to simulate [69, 67].
The design of a coarsegrained model for a system with atoms into a reduced representation with effective “beads” starts by the definition of a mapping similar to Eq. (5) where now are the coordinates of the coarsegrained beads. In this case, the mapping is usually a linear function, as in general the beads are defined as a subset or a linear combination of sets of atoms in the original system, in a way that allows to keep some information about the geometry of the molecule. For instance, in protein systems, a coarsegrained mapping can be defined by replacing all the atoms in a residue by a bead centered on the corresponding atom. There is at present no general theory to define the optimal coarsegraining mapping for a specific system. A few methods have been proposed in this direction, by optimizing the mapping to preserve certain properties of the original system. Examples include the definition of a system’s dynamical assembly units [70]
or the use of an autoencoder to minimize the reconstruction error
[71].Once the coarsegraining mapping is given, several strategies exist to define the model energy function, either to match experimental observables (topdown), or to reproduce specific properties of the atomistic system (bottomup) [67]. If one wants to define a coarsegrained model that is thermodynamically consistent with the original model (Eqs. 67), then, the definition of the energy function associated with a coarsegrained molecular model can be seen as a special case of free energy learning discussed in the previous section [66]. The effective energy of the coarsegrained model is the free energy defined by Eq. (6) where are now the coarse variables. Therefore, once the coarsegraining map is defined, the definition of the effective potential can also be seen as a learning problem. In particular, the force matching loss function given by Eq. (8) can be used to train the effective energy of the model from the atomistic forces. In the case of a linear mapping with a crisp assignment of atoms into beads, a matrix exists such that and the expression for the loss, Eq. (8), reduces to:
(9) 
where for each bead , is the sum of the atomistic forces of all the atoms mapping to that bead. This loss function has been used to design coarse grained force fields for different systems both with kernel methods [16] and deep neural networks [17, 18].
2.4 Kinetics
Kinetics are the slow part of the dynamics. Due to the stochastic components in the MD integrator, for any trajectory emerging from a configuration at time
, there is a probability distribution of finding the molecule in configuration
at a later time :(10) 
We can express the transition density (10) as the action of the Markov propagator in continuousspace, and by its spectral decomposition [72, 73]:
(11) 
The spectral decomposition can be read as follows: The evolution of the probability density can be approximated as the superposition of functions . A second set of functions, , is required in order to compute the amplitudes of these functions.
In general, Eq. (11
) is a singular value decomposition with left and right singular functions
and true singular values [73]. The approximation then is a lowrank decomposition in which the small singular values are discarded. For the special case that dynamics are in thermal equilibrium, Eq. (21) holds, and Eq. (11) is an eigenvalue decomposition with the choices:
(12)  
Hence Eq. (11
) simplifies: we only need one set of functions, the eigenfunctions
. The true eigenvalues are realvalued and decay exponentially in time with the characteristic relaxation times that are directly linked to kinetic experimental observables [74, 75]. The approximation in Eq. (11) is due to truncating all terms with relaxation times shorter than .A quite established approach is to learn molecular kinetics (Eq. 11) from a given trajectory dataset. In order to obtain a lowdimensional model of the molecular kinetics that is easy to interpret and analyze, this usually involves two steps: (i) finding a lowdimensional latent space representation of the collective variables, , using the encoder , and (ii) learning a dynamical propagator in that space:
(13) 
A common approach in MD, but also in other fields such as dynamical systems and fluid mechanics, is to seek a pair , such that
is a small matrix that propagates state vectors in a Markovian (memoryless) fashion
[76, 77, 78, 79, 74, 80, 81, 82, 83, 84, 73]. This is motivated by the spectral decomposition of dynamics (Eq. 11): If is large enough to filter fast processes, a few functions are sufficient to describe the kinetics of the system, and if maps to the space spanned by these functions, can be a linear, Markovian model.If no specific constraints are imposed on , the minimum regression error of , the variational approach of Markov processes (VAMP) [85, 73], or maximum likelihood will all lead to the same estimator [86]:
(14) 
using the latent space covariance matrices
If
performs a onehotencoding that indicates which “state” the system is in, then the pair
is called Markov state model (MSM [76, 77, 78, 79, 74, 80]), and is a matrix of conditional probabilities to be in a state at time given that the system was in a state at time .Learning the embedding is more difficult than , as optimizing by maximum likelihood or minimal regression error in latent space leads to a collapse of to trivial, uninteresting solutions [86]. This problem can be avoided by following a variational approach to approximating the leading terms of the spectral decomposition (10) [85, 73]. The Variational Approach for Conformation dynamics (VAC) [85] states that for dynamics obeying detailed balance (21), the eigenvalues of a propagator matrix via any encoding are, in the statistical limit, lower bounds of the true . The VAMP variational principle is more general, as it does not require detailed balance (21), and applies to the singular values :
As VAMP is the more general principle, we can use it to define the loss function for estimating molecular kinetics models:
2.5 Sampling and Thermodynamics
MD time steps are on the order of one femtosecond ( s), while configuration changes governing molecular function are often rare events that may take to s. Even when evaluating the potential and the forces it generates is fast, simulating single protein folding and unfolding events by direct MD simulation may take years to centuries on a supercomputer. To avoid this sampling problem, machine learning methods can be employed to learn generating equilibrium samples from more efficiently, or even by generating statistically independent samples in “one shot” (Sec. 2.5).
Learning to sample probability distributions is the realm of generative learning [87, 88, 89, 90]. In the past few years, directed generative networks, such as variational Autoencoders (VAEs) [88], generative adversarial networks (GANs) [89], and flows [91, 90], have had a particular surge of interest. Such networks are trained to transform samples from an easytosample probability distribution, such as Gaussian noise, into realistic samples of the objects of interest. These methods have been used to draw photorealistic images [92, 93], generate speech or music audio [94] and generate chemical structures to design molecules or materials with certain properties [42, 43, 44, 45].
If the aim is to learn a probability distribution from sampling data (density estimation) or learn to efficiently sample from a given probability distribution (Boltzmann generation [37]), one typically faces the challenge of matching the probability distribution of the trained model with a reference.
Matching probability distributions can be achieved by minimizing probability distances. The most commonly used probability distance is the KullbackLeibler (KL) divergence, also called relative entropy between two distributions and :
(16) 
In Boltzmann generation [37], we aim to efficiently sample the equilibrium distribution of a manybody system defined by its energy function (Eq. 1). We can learn the parameters of a neural network that generates samples from the distribution by making its generated distribution similar to the target equilibrium distribution. Choosing and in Eq. (16), we can minimize the with the energy loss:
(17) 
As the network samples from an energy surface (Eq. 1), this loss is performing energy matching, similar as in Sec. 2.1. In order to evaluate the loss (17), we need not only to be able to generate samples from the network, but also to be able to compute for every sample . See Sec. 4.5 for an example.
The reverse case is density estimation. Suppose we have simulation data , and we want to train the probability distribution to resemble the data distribution. We choose as the data distribution and in Eq. (16), and exploit that is a constant that does not depend on the parameters . Then we can minimize the loss:
(18) 
This loss is the negative likelihood that the model generates the observed sample, hence minimizing it corresponds to a maximum likelihood approach. Likelihood maximization is abundantly used in machine learning, in this review we will discuss it in Sec. 4.5.
3 Incorporating Physics into Machine Learning
3.1 Why incorporate physics?
Compared to classical ML problems such as image classification, we have a decisive advantage when working with molecular problems: we know a lot of physical principles that restrict the possible predictions of our machine to the physically meaningful ones.
Let us start with a simple example to illustrate this. We are interested in predicting the potential energy and the atomwise forces of the diatomic molecule with positions in vacuum (without external forces). Independent of the details of our learning algorithm, physics tells us that a few general rules must hold:

The energy is invariant when translating or rotating the molecule. We can thus arbitrarily choose the positions , and the energy becomes a function of the interatomic distance only: .

Energy conservation: the energy and the force of the molecule are related by . Now we can compute the components of the force as: .

Indistinguishability of identical particles: The energy is unchanged if we exchange the labels “1” and “2” of the identical oxygen atoms.
In machine learning, there are two principal approaches when dealing with such invariances or symmetries: (i) Data augmentation, and (ii) Building the invariances into the ML model.
3.2 Data augmentation
Data augmentation means we learn invariances “by heart” by artificially generating more training data and applying the known invariances to it. For example, given a training data point for positions and energy/force labels, , we can augment this data point with translation invariance of energy and force by adding more training data with random displacements . Data augmentation makes the ML model more robust and will help to predicting the invariances approximately. It is an important ML tool, as it is easy to do, while for certain invariances it is conceptually difficult or computationally expensive to hardwire them into the ML model.
However, data augmentation is statistically inefficient, as additional training data are needed, and inaccurate because a network that does not have translation invariance hardwired into it will never predict that the energy is exactly constant when translating the molecule. This inaccuracy may lead to unphysical, and potentially catastrophic predictions when such energy model is inserted into an MD integrator.
3.3 Building physical constraints into the ML model
The more accurate, statistical efficient and also more elegant approach to incorporating physical constraints is to directly build them into the ML model. Doing so involves accounting for two related aspects that are both essential to make the learning problem efficient: (i) equivariances: the ML model should have the same invariances and symmetries as the modeled physics problem, (ii) parameter sharing: whenever there is an invariance or symmetry, this should be reflected by sharing model parameters in different parts of the network.
3.4 Invariance and Equivariance
If we can hardwire the physical symmetries into the ML structure, we can reduce the dimensionality of the problem. In the example of the molecule above, the energy and force are one and sixdimensional functions defined for each point in 6dimensional configuration space. However, when we use the invariances described above we only have to learn a onedimensional energy function over a onedimensional space, and can compute the full force field from it. The learning problem has become much simpler because we now learn only in the manifold of physically meaningful solutions (Fig. 2a).
An important concept related to invariance is equivariance. A function is equivariant if it transforms the same way as its argument. For example, the force is defined by the negative gradient of the energy, . If we rotate the molecule by applying the rotation , the energy is invariant, but the force is equivariant as it rotates in the same way as (Fig. 2b):
Equivariances are closely linked to convolutions in machine learning. For example, standard convolutions are translation invariant: Each convolution kernel is a feature detector that is applied to each pixel neighborhood [95]. When that feature is present in the image, the convolved image will show a signal in the corresponding position (Fig. 2c). When translating the input, the convolved image translates in the same way (Fig. 2c).
We will briefly review below common invariances/equivariances useful for applications to molecular simulations, and discuss strategies to incorporate them in an ML algorithm.
3.4.1 Rototranslational invariance/equivariance
Physical quantities that only depend on the interactions of the atoms within a molecule should be invariant with respect to translation and rotation . Examples include potential and free energies of a molecule without an external field:
The force is equivariant to rotation, but invariant to translation (Fig. 2b, Sec. 3.4):
Rototranslational invariance can be achieved by transforming the configuration into rototranslationally invariant features, such as intramolecular distances or angles. Equivariance of the force can then be achieved by computing the force explicitly by a network layer that computes the derivatives with respect to , as it is done, e.g., in SchNet [96] and CGnet [18] (Fig. 5a). Note that periodic systems, such as crystals and explicitsolvent boxes have translational but not rotational invariances/equivariances.
3.4.2 Permutational invariance/equivariance.
Physical quantities, such as quantum mechanical energies, are invariant if we exchange the labels of identical atoms, e.g., Carbons. As the number of possible permutations increases exponentially with the number of identical particles, trying to learn permutation invariance by data augmentation is hopeless. Permutation invariance can be built into the ML model by choosing a functional form to compute the quantity of interest that is permutation invariant – see [97] for the general conditions. Following pioneering work of [98, 99, 100, 101], a specific choice that is common for networks that compute extensive quantities such as potential energies is to model it as a sum
(19) 
where is the contribution of the energy by the th atom in its chemical environment [7, 11, 58, 96]. In order to account for the multibody character of quantum mechanics, must generally also be a multibody function. Eq. (19) can be implemented by using separate networks for the individual contributions and adding up their results (Fig. 3). The force resulting from Eq. (19) is automatically permutation equivariant.
Classical MD force fields define bonded interactions by assigning atoms to nodes in a bond graph, and thus exchanging of individual atom pairs no longer preserves the energy. However, the energy is still invariant to the exchange of identical molecules, e.g., solvent, and is important to take that into account for coarsegraining [18] and generating samples from the equilibrium density [37].
A simple alternative to building permutation invariance into the ML function is to map all training and test data to a reference permutation. This can be done efficiently by socalled bipartite graph matching methods such as the Hungarian method [102], that are frequently used in recent learning models [103, 10, 37].
3.4.3 Energy conservation
Closed physical systems are energyconserving, which implies that the force field is defined by the gradient of the energy (Eq. 3). When learning potential energy, it can be a great advantage to use force information, because each atom configuration is associated with only one energy but forces, which may result in superior data efficiency when force labels are used during learning [57, 10]
. If we use supervised learning for coarsegraining with thermodynamic consistency, we can only use forces, as no labels for the free energies are available (Sec.
2.3).In these examples, we have labeled training data . Using a network that directly predicts the forces would not guarantee that Eq. (3) holds. Therefore, when Eq. (3) holds, it is common to learn an energy and then computing the force in the network using a gradient layer [96, 18] (Fig. 5a). An alternative to ensure Eq. (3) is gradientdomain machine learning [57].
3.4.4 Probability conservation and stochasticity
In statistical mechanics (both equilibrium and nonequilibrium), we are interested in the probability of events. An important principle is thus probability conservation, i.e. that the sum over all events is 1. A common approach to encode the probability of classes or events with a neural network is to use a SoftMax output layer, where the activation of each neuron can be defined as
(20) 
where runs over all neurons in the layer. In this representation, can be seen as a vector of energies giving rise to the Boltzmann probabilities . are nonnegative because of the exponential functions and sum up to .
In Markov state models (MSMs) of molecular kinetics, we would like to obtain a Markov transition matrix which is stochastic, i.e. for all elements and for all . If the encoder uses a SoftMax, then the estimator (14) will result in a transition matrix that conserves probability ( for all ). SoftMax can be exploited in VAMPnets to simultaneously learn an embedding from configurations to metastable states, and a Markov transition matrix (Sec. 4.4) [21].
3.4.5 Detailed balance
Detailed balance connects thermodynamics and dynamics. In a dynamical system that evolves purely in thermal equilibrium, i.e. without applying external forces, the equilibrium distribution and the transition probability are related by the following symmetry:
3.5 Parameter sharing and convolutions
A decisive advance in the performance of neural networks for classical computer vision problems such as text or digit recognition came with going from fully connected “dense” networks to convolutional neural networks (CNN)
[95]. Convolution layers are not only equivariant, and thus help with the detection of an object independent of its location (Fig. 2c), the real efficiency of CNNs is due to parameter sharing.In a classical dense network, all neurons of neighboring layers are connected and have independent parameters stored in a weight matrix
. This leads to a problem with highdimensional data. If we were to process relatively small images, say
pixels, and associate each pixel with a neuron, a single dense neural network layer will have parameters in . This would not only be demanding in terms of memory and computing time, a network with so many independent parameters would likely overfit and not be able to generalize to unknown data.Convolutions massively reduce the number of independent parameters. A convolutional layer with a filter acting on a onedimensional signal computes, before applying bias and nonlinearities,
(22) 
In terms of an image, convolution applies the same filter to every pixel neighborhood, in other words, all pixel transformations share the same parameters. Additionally, the filters are usually much smaller than the signal (often for images), thus each convolution only has few parameters. For molecules, we extend this idea to continuous convolutions that use particle positions instead of pixels (Sec. 4.2).
Besides the sheer reduction of parameters, parameter sharing, e.g. via convolution layers, is the keystone of transferability across chemical space. In convolutions of a greyscale image, we apply the same filters to every pixel, which implies translational equivariance, but also means “the same rules apply to all pixels”. If we have multiple color channels, we have different channels in the filters as well, so same color channels behave the same. In molecules we can apply the same idea to particle species. When learning energies from QM data, for example, every chemical element should be treated the same, i.e. use the same convolution filters in order to sense its chemical environment. This treatment gives us a “building block” principle that allows us to train on one set of molecules and make predictions for new molecules.
4 Deep Learning Architectures for Molecular Simulation
In this section, we present specific methods and neural network architectures that have been proposed to tackle the machine learning problems discussed in Sec. 2.
4.1 BehlerParrinello, ANI, Deep Potential net
BehlerParrinello networks are one of the first applications of machine learning in the molecular sciences [7]. They aim at learning and predicting potential energy surfaces from QM data and combine all of the relevant physical symmetries and parameter sharing for this problem (Sec. 3).
First, the molecular coordinates are mapped to rototranslationally invariant features for each atom (Fig. 3a). In this step, the distances of neighboring atoms of a certain type, as well as the angles between two neighbors of certain types are mapped to a fixed set of correlation functions that describes the chemical environment of atom . These features are the input to a dense neural network which outputs one number, the energy of atom in its environment. By design of the input feature functions, this energy is rototranslationally invariant. Parameters are shared between equivalent atoms, e.g., all carbon atoms are using the same network parameters to compute their atomic energies, but since the chemical environments will differ, their energies will differ. In a second step, the atomic energies are summed over all atoms of the molecule (Fig. 3b). This second step, combined with parameter sharing, achieves permutation invariance, as explained in Sec. 3.4.2. Transferability is achieved due to parameter sharing, but also because the summation principle allows to grow or shrink the network to molecules of any size, including sizes that were never seen in the training data.
A related approach is DeepPotential net [58]
, where each atom is treated in a local coordinate frame that has the rotation an translation degrees of freedom removed.
BehlerParrinello networks are traditionally trained by energy matching (Sec. 2.1) [7, 52], but can be trained with force matching if a gradient layer is added to compute the conservative force (Sec. 2.1, Eq. 3). The BehlerParrinello method has been further developed in the ANI network, e.g., by extending to more advanced functions involving two neighbors [11].While BehlerParrinello networks have mainly been used to make predictions of the same molecular system in order to run MD simulations unaffordable by direct abinitio QM MD [52], ANI has been trained on DFT and coupledcluster data across a large chemical space [11, 104, 12].
4.2 Deep Tensor Neural Nets, SchNet and continuous convolutions
One of the first deep learning architectures to learn to represent molecules or materials is the family of Deep Tensor Neural Networks (DTNN)
[14], with its recent addition SchNet [54, 105]. While in kernelbased learning methods [106, 2] chemical compounds are compared in terms of prespecified kernel functions [8, 107, 108, 109], DTNN and its extension SchNet learn a multiscale representation of the properties of molecules or materials from large data sets.DTNNs were inspired by the language processing approach wordtovec [110], where the role of a word within its grammatical or semantic context are learned and encoded in a parameter vector. Likewise, DTNNs learn a representation vector for each atom within its chemical environment (Fig. 4b left). DTNN’s tensor construction algorithm then iteratively learns higherorder representations by first interacting with all pairwise neighbors, e.g. extracting information implemented in the bond structure (Fig. 4b middle). By stacking such interaction layers deep, DTNNs can represent the structure and statistics of multibody interactions in deeper layers. As DTNNs are endtoend trained to predict certain quantum mechanical quantities, such as potential energies, they learn the representation that is relevant for the task of predicting these quantities [111].
SchNet [54] uses a deep convolutional neural network (CNN)[112, 113]. Classically, CNNs were developed for computer vision using pixelated images, and hence use discrete convolution filters. However, the particle positions of molecules cannot be discretized on a grid as quantum mechanical properties such as the energy are highly sensitive to small position changes, such as the stretching of a covalent bond. For this reason, SchNet introduced continuous convolutions [54, 96], which are represented by filtergenerating neural networks that map the rototranslationally invariant interatomic distances to the filter values used in the convolution (Fig. 4b right).
DTNN and SchNet have both reached highly competitive prediction quality both across chemical compound space and across configuration space in order to simulate molecular dynamics. In addition to their prediction quality, their scalability to large data sets [105] and their ability to extract novel chemical insights by means of their learnt representation make the DTNN family an increasingly popular research tool.
4.3 Coarsegraining: CGnets
As mentioned in section 2.3, machine learning has been used to define coarsegrained models for molecular systems. Both kernel methods [16] and deep neural networks [17, 18] have been designed to minimize the forcematching loss, Eq. (9), for given coarsegraining mappings for specific systems.
In both cases, it has been shown that the incorporation of physical constraints is crucial to the success of the model. The training data are obtained by means of atomistic molecular dynamic simulations and regions of the configurational space that are physically forbidden, such as configurations with broken covalent bonds or overlapping atoms, are not sampled and not included in the training. Without additional constraints, the machine cannot make predictions far away from the training data, and will thus not reliably predict that the energy should diverge when approaching physically forbidden regions.
Excluding highenergy states such as broken bonds or colliding atoms is different from enforcing physical symmetries as described in Sec. 3.4. Rather, it is about enforcing the correct asymptotic behavior of the energy when going towards an unphysical limit. CGnets proposed to achieve this by learning the difference to a simple prior energy that was defined to have the correct asymptotic behavior [18] (Fig. 5a). The exact form of this prior energy is not essential for success, as the CGnet can correct the prior energy where training data is available. In [18], the prior energy consisted of harmonic terms for bonds and angles of coarsegrained particles whose equilibrium values and force constants were obtained with Boltzmann inversion, as well as excluded volume terms in the form of were is the interparticle distance and are hyperparameters.
As in BehlerParrinello networks and SchNet, CGnet predicts a rototranslationally invariant energy as the first layer transforms the Cartesian coordinates into internal coordinates such as distances and angles (Fig. 5a). Furthermore, CGnet predicts a conservative and rotationequivariant force field as the gradient of the total free energy with respect to input configuration is computed selfconsistently by the network (see Fig. 5a). The network is trained by minimizing the force matching loss of this prediction (Eq. 8).
Fig. 5bc shows an application of CGnets to the coarsegraining of the miniprotein Chignolin, in which all solvent molecules are coarsegrained away and the atoms of each residue are mapped to the corresponding atom (Fig. 5b). MD simulations performed with the force field predicted by the CGnet predicts a free energy surface that is quantitatively similar to the free energy surface of the allatom simulations, and resolves the same metastable states (folded, unfolded and misfolded). In contrast, a splinebased coarsegrained model where only twobody terms are included in the energy function cannot reproduce the allatom free energy surface, and does not even predict that folded and unfolded are separated metastable states. These results clearly illustrate the importance of multibody interactions in the coarsegrained energy, for example surface or volume terms that can describe implicit solvation. While the spline model can be dramatically improved by adding suitable terms to list of features, this is not necessary when using a deep neural network which automatically learns the required multibody terms. Similar conclusions can be obtained by using Gaussian Approximation Potentials as the machine learning model to capture multibody terms in coarsegrained energy functions [16].
4.4 Kinetics: VAMPnets
VAMPnets [21] were introduced to replace the complicated and errorprone approach of constructing Markov state models by (i) searching for optimal features, (ii) combining them into a lowdimensional representation , e.g., via TICA [114], (iii) clustering , (iv) estimating the transition matrix , and (v) coarsegraining . VAMPnets uses instead a single endtoend learning approach in which all of these steps are replaced by a deep neural network. This is possible because with the VAC and VAMP variational principles (Sec. 2.4, [85, 73]), loss functions are available that are suitable to train the embedding and the propagator simultaneously (see Sec. 2.4, Eq. 13).
VAMPnets contain two network lobes representing the embedding . These networks transform the molecular configurations found at a time delay along the simulation trajectories (Fig. 6a). VAMPnets can be trained by minimizing the VAMP loss, Eq. (15), which is meaningful for both dynamics with and without detailed balance [73]. VAMPnets may, in general, use two distinct network lobes to encode the spectral representation of the left and right singular functions (which is important for nonstationary dynamics [115, 116]). EDMD with dictionary learning [117] uses a similar architecture as VAMPnets, but is optimized by minimizing the regression error in latent space. In order to avoid collapsing to trivial embeddings such as constant functions (see Sec. 2.4) a suitable regularization must be employed [117].
While hyperparameter selection can be performed by minimizing the variational loss (15) on a validation set [118, 119, 120, 121], it is important to test the performance of a kinetic model on timescales beyond the training timescale . We can use the ChapmanKolmogorov equation to test how well the learnt model predicts longer times:
(23) 
A common way to implement this test is to compare the leading eigenvalues of the left and right hand sides [77, 122, 80].
In [21], parameters were shared between the VAMPnets nodes, and thus a unique embedding is learned. When detailed balance is enforced while computing (Eq. 21), the loss function automatically becomes a VAC score [85]. In this case, the embedding encodes the space of the dominant Markov operator eigenfunctions [21]. This feature was extensively studied in statefree reversible VAMPnets [23].
In order to obtain a propagator that can be interpreted as a Markov state model, [21]
chose to use a SoftMax layer as an output layer, thus transforming the spectral representation to a soft indicator function similar to spectral clustering methods such as PCCA+
[123, 124]. As a result, the propagator computed by Eq. (14) conserves probability and is almost a transition matrix (Sec. 3.4.4), although it may have some negative elements with small absolute values.The results described in [21] (see, e.g., Fig. 6) were competitive with and sometimes surpassed the stateoftheart handcrafted MSM analysis pipeline. Given the rapid improvements of training efficiency and accuracy of deep neural networks seen in a broad range of disciplines, we expect endtoend learning approaches such as VAMPnets to dominate the field eventually.
In [28], a deep generative Markov State Model (DeepGenMSM) was proposed that, in addition to the encoder and the propagator learns a generative part that samples the conditional distribution of configurations in the next time step. The model can be operated in a recursive fashion to generate trajectories to predict the system evolution from a defined starting state and propose new configurations. The DeepGenMSM was demonstrated to provide accurate estimates of the longtime kinetics and generate valid distributions for small molecular dynamics benchmark systems.
4.5 Sampling/Thermodynamics: Boltzmann Generators
Boltzmann Generators were introduced in [37] to learn to sample equilibrium distributions, Eq. (1). In contrast to standard generative learning, a Boltzmann Generator does not attempt to learn the probability density from data, but is trained to efficiently sample using the dimensionless energy as an input. A Boltzmann Generator consists of two parts:

A generative model that is trained to propose samples from a probability distribution which is “similar” to , and that allows us to evaluate (up to a constant) for every .

A reweighting procedure that takes proposals from and generates unbiased samples from ().
Boltzmann Generators use a trainable generative network which maps latent space samples
from a simple prior, e.g., a Gaussian normal distribution, to samples
. Training is done by combining the energybased training using the KL divergence, Eq. (17), and maximum likelihood, Eq. (18).For both, training and reweighting, we need to be able to compute the probability of generating a configuration . This can be achieved by the changeofvariables equation if is an invertible transformation (Fig. 7a) [91, 90]. Such invertible networks are called flows due to the analogy of the transformed probability density with a fluid [90, 125]. In [37], the nonvolume preserving transformations RealNVP were employed [125], but the development of more powerful invertible network architectures is an active field of research [126, 93, 127]. By stacking multiple invertible “blocks”, a deep invertible neural network is obtained that can encode a complex transformation of variables.
Fig. 7ch illustrate the Boltzmann Generator on a condensedmatter model system which contains a bistable dimer in a box densely filled with repulsive solvent particles (Fig. 7
c,d). Opening or closing the dimer is a rare event, and also involves the collective rearrangement of the solvent particles due to the high particle density. Using short MD simulation in the open and closed states as an initialization, the Boltzmann Generator can be trained to sample open, closed and the previously unseen transition states by generating Gaussian random variables in latent space (Fig.
7e) and feeding them through the transformation . Such samples have realistic structures and close to equilibrium energies (Fig. 7f). By employing reweighting, free energy differences can be computed (Fig. 7g). There is a direct relationship between the temperature of the canonical ensemble and the variance of the latentspace Gaussian of the Boltzmann Generator [37]. This allows us to learn to generate thermodynamics, such as the temperaturedependent free energy profiles, using a single Boltzmann Generator (Fig. 7g). Finally, as the latent space concentrates configurations of equilibrium probability around the origin, Boltzmann Generators can be used to generate physically realistic reaction pathways by performing linear interpolations in latent space.
5 Discussion
Despite rapid advances in the field of Machine Learning for Molecular Simulation, there are still significant open problems that need to be addressed, in all the areas discussed above.
5.1 Accuracy and efficiency in quantumchemical energies and forces
In order to be practically useful, a ML model for both PES and atomic forces is needed that: i) can yield accuracy of kcal/mol for the energy per functional group and about kcal/mol/Å for the force per atom; ii) is not much more expensive to evaluate than classical force fields, iii) scales to large molecules such as proteins; and iv) is transferable to different covalent and noncovalent environments. Such universal model does not exist yet.
Crucial steps towards iii) have been recently taken by symmetrized gradientdomain machine learning (sGDML), a kernelbased approach to constructing molecular force fields [57, 10, 128, 129]. Currently, sGDML already enables MD simulations with electrons and nuclei treated at essentially exact quantummechanical level for molecules with up to 2030 atoms.
Networkbased approaches, such as Schnet, ANI etc, are better suited to iiiiv), as they break down the energy in local interactions of atoms with their environment, thus enabling a “building block” principle that is by design better scalable to molecules of different size and transferable across chemical space. However, these approaches do not reach the high accuracy in configuration space that sGDML does. Combining high accuracy in configuration and chemical space remains an active research topic.
5.2 Longranged interactions
The vast majority of approaches to make ML inference on molecular structures are based on local chemical information. Current neural networks for modeling molecular energies use the summation principle (e.g., Eq. 19) in order to sum up local energies of atom with its neighbors. While multibody and longranged energies can be obtained by stacking multiple layers [14, 54, 96] – the working principle of deep convolution networks [112] – there are fundamental physical limits of this approach: Longranged interactions such as electrostatics cannot be cut off.
For classical pointcharge models, longranged electrostatics methods have been developed, such as the Ewald summation method for periodic systems [130]. One option is to combine shortranged ML models with such longranged electrostatics methods. In order to avoid double counting interactions, one must also predict atomic charges, which is an active field of research [131, 132]. An alternative option, and currently unexplored territory, is to develop neural network structures for particle interactions that can compute longranged interactions by design.
In addition to electrostatics, van der Waals (vdW) dispersion interactions can also have a substantial longrange character, i.e. they can extend to separations of tens of nanometers or more in large molecular and nanoscale systems [133, 134, 135]. Developing ML models that correctly treat the quantummechanical manybody nature of vdW interactions remains a difficult challenge to overcome [15].
5.3 Quantum Kinetics
With the availability of chemically transferable ML models that have quantumchemical accuracy, the next open problem is to sample metastable states and long timescale kinetics. Although available ML models for predicting QM energies and forces are still significantly slower than force fields, the vast array of enhanced sampling methods and kinetic models (Sec. 2.4,4.4) will likely allow us to explore kinetics of quantum chemical systems on timescales of microseconds and beyond. A plethora of new physical insights that we cannot access with current MD force fields awaits us there. For example, what is the role of protonation dynamics in mediating protein folding or function?
5.4 Transferability of coarsegrained models
An outstanding question in the design of coarsegrained models is that of transferability across chemical space. Bottomup coarsegrained models are useful in practice if they can be parametrized on small molecules and then used to predict the dynamics of systems much larger than what is possible to simulate with atomistic resolution. It is not clear to what extent transferability of coarsegrained models can be achieved, and how that depends on the coarsegraining mapping [136, 137]. Compared to the manual design of fewbody free energy functionals, machinelearned free energies can help with transferability, as they are able to learn the important multibody effects, e.g., to model neglected solvent molecules implicitly (Sec. 4.3) [16, 17, 18].
It is natural to consider BehlerParrinello type networks or SchNet as a starting point for modeling transferable coarsegrained energies, but their application is nontrivial: it is a priori unclear what the interacting particles are in the coarsegrained model and how to define their “types”, as they are no longer given by the chemical element. Furthermore, these networks assume permutation invariance between identical particles, while classical MD force fields do not have permutation invariance of atoms within the same molecule. Therefore, particle network structures that can handle bonding graphs need to be developed.
5.5 Kinetics of coarsegrained models
While coarsegrained MD models may perform well in reproducing the thermodynamics of the atomistic system, they may fail in reproducing the kinetics. Existing approaches include adding fictitious particles [138], or training the coarsegrained model with spectral matching [139]. There is indication that the kinetics can be approximately up to a global scaling factor in barriercrossing problems when the barriers are well approximated [140], which could be achieved by identifying the slow reaction coordinates [59], and assigning more weight to the transition state in force matching or relative entropy minimization. This area of research is still underdeveloped.
5.6 Transferable prediction of intensive properties
Extensive properties such as potential energies can be well predicted across chemical space, as they can be conceptually broken down as a sum of parts that can be learnt separately. This is not possible with intensive properties such as spectra or kinetics, and for this reason the prediction of such properties is, as yet, far behind.
5.7 Equivariant generative networks with parameter sharing
Generative networks, such as Boltzmann Generators (Sec. 4.5) have been demonstrated to be able to generate physically realistic oneshot samples of model systems and proteins in implicit solvent [37]. In order to scale to larger systems, important steps are to build the invariances of the energy, such as the exchange of identical solvent particles, into the transformation, and to include parameter sharing (Sec. 3.5), such that we can go beyond just sampling the probability density of one given system with energy and instead generalize from a dataset of examples of one class of molecules, e.g. solvated proteins. To this end, equivariant networks with parameter sharing need to be developed for generative learning, which are, to date, not available.
5.8 Explainable AI
Recently, the increasing popularity of explainable AI methods (see e.g. [141, 142, 143, 144]) have allowed us to gain insight into the inner workings of deep learning algorithms. In this manner, it has become possible to extract how a problem is solved by the deep model. This allows for example to detect socalled ‘clever Hans’ solutions [143], i.e. nonsensical solutions relying on artifactual or nonphysical aspects in data. Combined with networks that learn a representation such as DTNN/Schnet [14, 96] and VAMPnets [21], these inspection methods may provide scientific insights into the mechanisms that give rise to the predicted physicochemical quantity, and thus fuel the development of new theories.
Acknowledgements
We gratefully acknowledge funding from European Commission (ERC CoG 772230 “ScaleCell” to F.N. and ERCCoG grant BeStMo to A.T.), Deutsche Forschungsgemeinschaft (CRC1114/A04 to F.N., EXC 2046/1, Project ID 390685689 to K.R.M., GRK2433 DAEDALUS to F.N. and K.R.M.), the MATH+ Berlin Mathematics research center (AA18 to F.H., EF12 to F.N. and K.R.M.), Einstein Foundation Berlin (Einstein Visiting Fellowship to C.C.), the National Science Foundation (grants CHE1265929, CHE1740990, CHE1900374, and PHY1427654 to C.C.), the Welch Foundation (grant C1570 to C.C.), the Institute for Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (No. 2017000451, No. 2017001779 to K.R.M.), and the German Ministry for Education and Research (BMBF) (Grants 01IS14013AE, 01GQ1115 and 01GQ0850 to K.R. M.). All authors thank Stefan Chmiela and Kristof Schütt for help with Figures 1 and 5.
References
 [1] P. A. M. Dirac. Quantum mechanics of manyelectron systems. Proc. Royal Soc. A, 123(792):714–733, 1929.
 [2] K.R. Müller, S. Mika, G. Rätsch, K. Tsuda, and B. Schölkopf. An introduction to kernelbased learning algorithms. IEEE Trans. Neural Netw., 12:181–201, 2001.
 [3] V. N. Vapnik. An overview of statistical learning theory. IEEE Trans. Neur. Net., 10, 1999.
 [4] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
 [5] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521:436–444, 2015.
 [6] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
 [7] J. Behler and M. Parrinello. Generalized neuralnetwork representation of highdimensional potentialenergy surfaces. Phys. Rev. Lett., 98:146401, 2007.
 [8] M. Rupp, A. Tkatchenko, K.R. Müller, and O. A. Von Lilienfeld. Fast and accurate modeling of molecular atomization energies with machine learning. Phys. Rev. Lett., 108:058301, 2012.
 [9] A. P. Bartók, M. C. Payne, R. Kondor, and G. Csányi. Gaussian approximation potentials: The accuracy of quantum mechanics, without the electrons. Phys. Rev. Lett., 104:136403, 2010.
 [10] S. Chmiela, H. E. Sauceda, K.R. Müller, and A. Tkatchenko. Towards exact molecular dynamics simulations with machinelearned force fields. Nat. Commun., 9:3887, 2018.
 [11] J. S. Smith, O. Isayev, and A. E. Roitberg. Ani1: an extensible neural network potential with dft accuracy at force field computational cost. Chem. Sci., 8:3192–3203, 2017.

[12]
J. S. Smith, B. T. Nebgen, R. Zubatyuk, N. Lubbers, C. Devereux, K. Barros,
S. Tretiak, O. Isayev, and A. Roitberg.
Outsmarting quantum chemistry through transfer learning.
https://doi.org/10.26434/chemrxiv.6744440.v1, 2018.  [13] F. Brockherde, L. Vogt, L. Li, M. E. Tuckerman, K. Burke, and K.R. Müller. Bypassing the kohnsham equations with machine learning. Nat. Commun., 8:872, 2017.
 [14] K. T. Schütt, F. Arbabzadah, S. Chmiela, K.R. Müller, and A. Tkatchenko. Quantumchemical insights from deep tensor neural networks. Nat. Commun., 8:13890, 2017.
 [15] T. Bereau, R. A. DiStasio Jr., A. Tkatchenko, and O. A. von Lilienfeld. Noncovalent interactions across organic and biological subsets of chemical space: Physicsbased potentials parametrized from machine learning. J. Chem. Phys., 148:241706, 2018.
 [16] S. T. John and G. Csányi. Manybody coarsegrained interactions using gaussian approximation potentials. J. Phys. Chem. B, 121:10934–10949, 2017.
 [17] L. Zhang, J. Han, H. Wang, R. Car, and W. E. DeePCG: constructing coarsegrained models via deep neural networks. J. Chem. Phys., 149:034101, 2018.
 [18] J. Wang, S. Olsson, C. Wehmeyer, A. Pérez, N. E. Charron, G. De Fabritiis, F. Noé, and C. Clementi. Machine learning of coarsegrained molecular dynamics force fields. ACS Cent. Sci., 5:755–767, 2019.
 [19] C. Wehmeyer and F. Noé. Timelagged autoencoders: Deep learning of slow collective variables for molecular kinetics. J. Chem. Phys., 148:241703, 2018.
 [20] C. X. Hernández, H. K. WaymentSteele, M. M. Sultan, B. E. Husic, and V. S. Pande. Variational encoding of complex dynamics. Phys. Rev. E, 97:062412, 2018.
 [21] A. Mardt, L. Pasquali, H. Wu, and F. Noé. VAMPnets: Deep learning of molecular kinetics. Nat. Commun., 9:5, 2018.
 [22] J. M. L. Ribeiro, P. Bravo, Y. Wang, and P. Tiwary. Reweighted autoencoded variational bayes for enhanced sampling (rave). J. Chem. Phys., 149:072301, 2018.
 [23] W. Chen, H. Sidky, and A. L. Ferguson. Nonlinear discovery of slow molecular modes using statefree reversible vampnets. J. Chem. Phys., 150:214114, 2019.
 [24] H. Jung, R. Covino, and G. Hummer. Artificial intelligence assists discovery of reaction coordinates and mechanisms from molecular dynamics simulations. arXiv:1901.04595, 2019.
 [25] T. Stecher, N. Bernstein, and G. Csányi. Free energy surface reconstruction from umbrella samples using gaussian process regression. J. Chem. Theory Comput., 10:4079–4097, 2014.
 [26] L. Mones, N. Bernstein, and G. Csányi. Exploration, sampling, and reconstruction of free energy surfaces with gaussian process regression. J. Chem. Theory Comput., 12:5100–5110, 2016.
 [27] E. Schneider, L. Dai, R. Q. Topper, C. DrechselGrau, and M. E. Tuckerman. Stochastic neural network approach for learning highdimensional free energy surfaces. Phys. Rev. Lett., 119(150601), 2017.
 [28] H. Wu, A. Mardt, L. Pasquali, and F. Noé. Deep generative markov state models. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Adv. Neural Inf. Process. Syst. 31, pages 3975–3984. Curran Associates, Inc., 2018.
 [29] O. Valsson and M. Parrinello. Variational approach to enhanced sampling and free energy calculations. Phys. Rev. Lett., 113:090601, 2014.
 [30] L. Bonati, Y.Y. Zhang, and M. Parrinello. Neural networks based variationally enhanced sampling. arXiv:1904.01305, 116:17641–17647, 2019.
 [31] J. Zhang, Y. I. Yang, and F. Noé. Targeted adversarial learning optimized sampling. J. Phys. Chem. Lett., 10:5791–5797, 2019.
 [32] J. McCarty and M. Parrinello. A variational conformational dynamics approach to the selection of collective variables in metadynamics. J. Chem. Phys., 147:204109, 2017.
 [33] M. M. Sultan and V. S. Pande. tICAmetadynamics: Accelerating metadynamics by using kinetically selected collective variables. J. Chem. Theory Comput., 13:2440–2447, 2017.
 [34] S. Doerr and G. De Fabritiis. Onthefly learning and sampling of ligand binding by highthroughput molecular simulations. J. Chem. Theory Comput., 10:2064–2069, 2014.
 [35] M. I. Zimmerman and G. R. Bowman. FAST conformational searches by balancing exploration/exploitation tradeoffs. J. Chem. Theory Comput., 11:5747–5757, 2015.
 [36] N. Plattner, S. Doerr, G. De Fabritiis, and F. Noé. Proteinprotein association and binding mechanism resolved in atomic detail. Nat. Chem., 9:1005–1011, 2017.
 [37] Frank Noé, Simon Olsson, Jonas Köhler, and Hao Wu. Boltzmann generators  sampling equilibrium states of manybody systems with deep learning. Science, 365:eaaw1147, 2019.
 [38] J. Jiménez, M. Skalic, G. MartinezRosell, and G. De Fabritiis. Kdeep: Protein–ligand absolute binding affinity prediction via 3dconvolutional neural networks. J. Chem. Inf. Model., 58:287–296, 2018.
 [39] M. Skalic, A. VarelaRial, J. Jiménez, G. MartínezRosell, and G. De Fabritiis. Ligvoxel: inpainting binding pockets using 3dconvolutional neural networks. Bioinformatics, 35:243–250, 2018.
 [40] Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K Leswing, and V. Pande. Moleculenet: a benchmark for molecular machine learning. Chem. Sci., 9:513–530, 2018.
 [41] E. N. Feinberg, D. Sur, Z. Wu, B. E. Husic, H. Mai, Y. Li, S. Sun, J. Yang, B. Ramsundar, and V. S. Pande. Potentialnet for molecular property prediction. ACS Cent. Sci., 4:1520–1530, 2018.
 [42] R. GómezBombarelli, J. N. Wei, D. Duvenaud, J. M. HernándezLobato, B. SánchezLengeling, D. Sheberla, J. AguileraIparraguirre, T. D Hirzel, R. P Adams, and A. AspuruGuzik. Automatic chemical design using a datadriven continuous representation of molecules. ACS Cent. Sci., 4:268–276, 2018.

[43]
M. Popova, O. Isayev, and A. Tropsha.
Deep reinforcement learning for de novo drug design.
Sci. Adv., 4:eaap7885, 2018.  [44] R. Winter, F. Montanari, A. Steffen, H. Briem, F. Noé, and D. A. Clevert. Efficient multiobjective molecular optimization in a continuous latent space. Chem. Sci., 10:8016–8024, 2019.
 [45] R. Winter, F. Montanari, F. Noé, and D.A. Clevert. Learning continuous and datadriven molecular descriptors by translating equivalent chemical representations. Chem. Sci., 10:1692–1701, 2019.
 [46] K. T. Butler, D. W. Davies, H. Cartwright, O. Isayev, and A. Walsh. Machine learning for molecular and materials science. Nature, 559:547–555, 2018.
 [47] B. SanchezLengeling and A. AspuruGuzik. Inverse molecular design using machine learning: Generative models for matter engineering. Science, 361:360–365, 2018.
 [48] Kresten LindorffLarsen, Paul Maragakis, Stefano Piana, Michael P. Eastwood, Ron O. Dror, and David E. Shaw. Systematic validation of protein force fields against experimental data. PLoS One, 7:e32131, 2012.
 [49] P. S. Nerenberg, B. Jo, C. So, A. Tripathy, and T. HeadGordon. Optimizing solutewater van der waals interactions to reproduce solvation free energies. J. Phys. Chem. B, 116:4524–4534, 2012.
 [50] J. Huang, S. Rauscher, G. Nawrocki, T. Ran, M. Feig, B. L. de Groot, H. Grubmüller, and A. D MacKerell. CHARMM36m: an improved force field for folded and intrinsically disordered proteins. Nat. Methods, 14:71–73, 2016.
 [51] P. Robustelli, S. Piana, and D. E. Shaw. Developing a molecular dynamics force field for both folded and disordered protein states. Proc. Natl. Acad. Sci. USA, 115:E4758–E4766, 2018.
 [52] J. Behler. Perspective: Machine learning potentials for atomistic simulations. J. Chem. Phys., 145:170901, 2016.
 [53] Z. Li, J. R. Kermode, and A. De Vita. Molecular dynamics with onthefly machine learning of quantummechanical forces. Phys. Rev. Lett., 114:96405, 2015.
 [54] K. T. Schütt, H. E. Sauceda, P.J. Kindermans, A. Tkatchenko, and K.R. Müller. SchNet  a deep learning architecture for molecules and materials. J. Chem. Phys., 148:241722, 2018.
 [55] M. Gastegger, J. Behler, and P. Marquetand. Machine learning molecular dynamics for the simulation of infrared spectra. Chem. Sci., 8:6924–6935, 2017.
 [56] P. O. Dral, A. Owens, S. N. Yurchenko, and W. Thiel. Structurebased sampling and selfcorrecting machine learning for accurate calculations of potential energy surfaces and vibrational levels. J. Chem. Phys., 146:244108, 2017.
 [57] S. Chmiela, A. Tkatchenko, H. E. Sauceda, I. Poltavsky, K. T. Schütt, and K.R. Müller. Machine learning of accurate energyconserving molecular force fields. Sci. Adv., 3:e1603015, 2017.
 [58] J. Han, L. Zhang, R. Car, and W. E. Deep potential: a general representation of a manybody potential energy surface. Phys. Rev. Lett., 120:143001, 2018.
 [59] F. Noé and C. Clementi. Collective variables for the study of longtime kinetics from molecular trajectories: theory and methods. Curr. Opin. Struc. Biol., 43:141–147, 2017.
 [60] R. Galvelis and Y. Sugita. Neural network and nearest neighbor algorithms for enhancing sampling of molecular dynamics. J. Chem. Theory Comput., 13:2489–2500, 2017.
 [61] H. Sidky and J. K. Whitmer. Learning free energy landscapes using artificial neural networks. J. Chem. Phys., 148:104111, 2018.
 [62] J. M. L. Ribeiro and P. Tiwary. Toward achieving efficient and accurate ligandprotein unbinding with deep learning and molecular dynamics through RAVE. J. Chem. Theory Comput., 15:708–719, 2018.
 [63] W. Chen and A. L. Ferguson. Molecular enhanced sampling with autoencoders: Onthefly collective variable discovery and accelerated free energy landscape exploration. J. Comput. Chem., 39:2079–2102, 2018.
 [64] M. M. Sultan, H. K. WaymentSteele, and V. S. Pande. Transferable neural networks for enhanced sampling of protein dynamics. J. Chem. Theory Comput., 14:1887–1894, 2018.
 [65] A. Z. Guo, E. Sevgen, H. Sidky, J. K. Whitmer, J. A. Hubbell, and J. J. de Pablo. Adaptive enhanced sampling by forcebiasing using neural networks. J. Chem. Phys., 148:134108, 2018.
 [66] W. G. Noid, JhihWei Chu, Gary S. Ayton, Vinod Krishna, Sergei Izvekov, Gregory A. Voth, Avisek Das, and Hans C. Andersen. The multiscale coarsegraining method. I. A rigorous bridge between atomistic and coarsegrained models. J. Chem. Phys., 128:244114, 2008.
 [67] W. G. Noid. Perspective: Coarsegrained models for biomolecular systems. J. Chem. Phys., 139:090901, 2013.
 [68] G. Ciccotti, T. Lelièvre, and E. VandenEijnden. Projection of diffusions on submanifolds: Application to mean force computation. Commun. Pure Appl. Math., 61:371–408, 2008.
 [69] C. Clementi. Coarsegrained models of protein folding: Toymodels or predictive tools? Curr. Opin. Struct. Biol., 18:10–15, 2008.
 [70] L. Boninsegna, R. Banisch, and C. Clementi. A datadriven perspective on the hierarchical assembly of molecular structures. J. Chem. Theory Comput., 14:453–460, 2018.
 [71] W. Wang and R. GómezBombarelli. Variational coarsegraining for molecular dynamics. arXiv:1812.02706, 2018.
 [72] M. Sarich, F. Noé, and C. Schütte. On the approximation quality of markov state models. Multiscale Model. Simul., 8:1154–1177, 2010.
 [73] H. Wu and F. Noé. Variational approach for learning markov processes from time series data. arXiv:1707.04659, 2017.
 [74] N. V. Buchete and G. Hummer. Coarse Master Equations for Peptide Folding Dynamics. J. Phys. Chem. B, 112:6057–6069, 2008.
 [75] F. Noé, S. Doose, I. Daidone, M. Löllmann, J. D. Chodera, M. Sauer, and J. C. Smith. Dynamical fingerprints for probing individual relaxation processes in biomolecular dynamics with simulations and kinetic experiments. Proc. Natl. Acad. Sci. USA, 108:4822–4827, 2011.
 [76] C. Schütte, A. Fischer, W. Huisinga, and P. Deuflhard. A Direct Approach to Conformational Dynamics based on Hybrid Monte Carlo. J. Comput. Phys., 151:146–168, 1999.
 [77] W. C. Swope, J. W. Pitera, and F. Suits. Describing protein folding kinetics by molecular dynamics simulations: 1. Theory. J. Phys. Chem. B, 108:6571–6581, 2004.
 [78] F. Noé, I. Horenko, C. Schütte, and J. C. Smith. Hierarchical Analysis of Conformational Dynamics in Biomolecules: Transition Networks of Metastable States. J. Chem. Phys., 126:155102, 2007.

[79]
J. D. Chodera, K. A. Dill, N. Singhal, V. S. Pande, W. C. Swope, and J. W.
Pitera.
Automatic discovery of metastable states for the construction of Markov models of macromolecular conformational dynamics.
J. Chem. Phys., 126:155101, 2007.  [80] J.H. Prinz, H. Wu, M. Sarich, B. G. Keller, M. Senne, M. Held, J. D. Chodera, C. Schütte, and F. Noé. Markov models of molecular kinetics: Generation and validation. J. Chem. Phys., 134:174105, 2011.
 [81] I. Mezić. Spectral properties of dynamical systems, model reduction and decompositions. Nonlinear Dynam., 41:309–325, 2005.
 [82] P. J. Schmid and J. Sesterhenn. Dynamic mode decomposition of numerical and experimental data. In 61st Annual Meeting of the APS Division of Fluid Dynamics. American Physical Society, 2008.
 [83] J. H. Tu, C. W. Rowley, D. M. Luchtenburg, S. L. Brunton, and J. N. Kutz. On dynamic mode decomposition: Theory and applications. J. Comput. Dyn., 1:391–421, 2014.
 [84] H. Wu, F. Nüske, F. Paul, S. Klus, P. Koltai, and F. Noé. Variational koopman models: slow collective variables and molecular kinetics from short offequilibrium simulations. J. Chem. Phys., 146:154104, 2017.
 [85] F. Noé and F. Nüske. A variational approach to modeling slow processes in stochastic dynamical systems. Multiscale Model. Simul., 11:635–655, 2013.
 [86] Frank Noé. Machine learning for molecular dynamics on long timescales. arXiv:1812.07669, 2018.
 [87] P. Smolensky. Chapter 6: Information processing. In David E. Rumelhart and James L. McLelland, editors, Dynamical Systems: Foundations of Harmony Theory, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations, pages 194–281. MIT Press., 1986.
 [88] D. P. Kingma and M. Welling. Autoencoding variational bayes. In Proceedings of the 2nd International Conference on Learning Representations (ICLR), arXiv:1312.6114, 2014.
 [89] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and J. Bengio. Generative adversarial networks. In NIPS’14 Proceedings of the 27th International Conference on Neural Information Processing Systems, volume 2, pages 2672–2680, MA, USA, 2014. MIT Press Cambridge.
 [90] L. Dinh, D. Krueger, and Y. Bengio. Nice: Nonlinear independent components estimation. arXiv:1410.8516, 2015.
 [91] E. G. Tabak and E. VandenEijnden. Density estimation by dual ascent of the loglikelihood. Commun. Math. Sci., 8:217–233, 2010.
 [92] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive Growing of GANs for Improved Quality, Stability, and Variation. Proceedings of the 6th International Conference on Learning Representations (ICLR), arXiv:1710.10196, 2018.
 [93] D. P. Kingma and P. Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. NIPS, 2018.
 [94] A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. van den Driessche, E. Lockhart, L. C. Cobo, F. Stimberg, N. Casagrande, D. Grewe, S. Noury, S. Dieleman, E. Elsen, N. Kalchbrenner, H. Zen, A. Graves, H. King, T. Walters, D. Belov, and D. Hassabis. Parallel WaveNet: Fast HighFidelity Speech Synthesis. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3918–3926, Stockholmsmässan, Stockholm Sweden, 2018. PMLR.
 [95] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proc. IEEE, 86:2278–2324, 1998.
 [96] Kristof Schütt, PieterJan Kindermans, Huziel Enoc Sauceda, Stefan Chmiela, Alexandre Tkatchenko, and KlausRobert Müller. Schnet: A continuousfilter convolutional neural network for modeling quantum interactions. In Adv. Neural Inf. Process. Syst., pages 991–1001, 2017.
 [97] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. Deep sets. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Adv. Neural Inf. Process. Syst., pages 3391–3401. Curran Associates, Inc., 2017.
 [98] F. Ercolessi, E. Tosatti, and M. Parrinello. Au (100) surface reconstruction. Phys. Rev. Lett., 57:719, 1986.
 [99] J. Tersoff. New empirical model for the structural properties of silicon. Phys. Rev. Lett., 56:632, 1986.
 [100] J. Ferrante, J. R. Smith, and J. H. Rose. Diatomic molecules and metallic adhesion, cohesion, and chemisorption: A single bindingenergy relation. Phys. Rev. Lett., 50:1385, 1983.
 [101] G. C. Abell. Empirical chemical pseudopotential theory of molecular and metallic bonding. Phys. Rev. B, 31:6184, 1985.
 [102] H. W. Kuhn. The hungarian method for the assignment problem. Nav. Res. Logist. Quart., 2:83–97, 1955.
 [103] F. Reinhard and H. Grubmüller. Estimation of absolute solvent and solvation shell entropies via permutation reduction. J. Chem. Phys., 126:014102, 2007.
 [104] J. S. Smith, O. Isayev, and A. E. Roitberg. Ani1, a data set of 20 million calculated offequilibrium conformations for organic molecules. Sci. Data, 4:170193, 2017.
 [105] K. T. Schütt, P. Kessel, M. Gastegger, K. A. Nicoli, A. Tkatchenko, and K.R. Müller. Schnetpack: A deep learning toolbox for atomistic systems. J. Chem. Theory Comput., 15:448–455, 2019.
 [106] V. Vapnik. The nature of statistical learning theory. Springer, 1995.
 [107] K. Hansen, G. Montavon, F. Biegler, S. Fazli, M. Rupp, M. Scheffler, O. Anatole von Lilienfeld, A. Tkatchenko, and K.R. Müller. Assessment and validation of machine learning methods for predicting molecular atomization energies. J. Chem. Theory Comput., 9:3404–3419, 2013.
 [108] K. Hansen, F. Biegler, R. Ramakrishnan, W. Pronobis, O. A. Von Lilienfeld, K.R. Müller, and A. Tkatchenko. Machine learning predictions of molecular properties: Accurate manybody potentials and nonlocality in chemical space. J. Phys. Chem. Lett., 6:2326–2331, 2015.
 [109] A. P. Bartók, R. Kondor, and G. Csányi. On representing chemical environments. Phys. Rev. B, 87:184115, 2013.
 [110] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Adv. Neural Inf. Process. Syst., pages 3111–3119, 2013.
 [111] M. L. Braun, J. M. Buhmann, and K. R. Müller. On relevant dimensions in kernel feature spaces. J. Mach. Learn. Res., 9:1875–1906, 2008.
 [112] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Comput., 1:541–551, 1989.
 [113] Y. LeCun, L. Bottou, G. B. Orr, and K.R. Müller. Efficient backprop. In J. B. Orr and K.R. Müller, editors, Neural Networks: Tricks of the Trade, volume 1524, pages 9–53. Springer LNCS, 1998.
 [114] G. PerezHernandez, F. Paul, T. Giorgino, G. D Fabritiis, and Frank Noé. Identification of slow molecular order parameters for markov model construction. J. Chem. Phys., 139:015102, 2013.
 [115] P. Koltai, G. Ciccotti, and Ch. Schütte. On metastability and markov state models for nonstationary molecular dynamics. J. Chem. Phys., 145:174103, 2016.
 [116] P. Koltai, H. Wu, F. Noé, and C. Schütte. Optimal datadriven estimation of generalized markov state models for nonequilibrium dynamics. Computation, 6:22, 2018.
 [117] Q. Li, F. Dietrich, E. M. Bollt, and I. G. Kevrekidis. Extended dynamic mode decomposition with dictionary learning: A datadriven adaptive spectral decomposition of the koopman operator. Chaos, 27:103111, 2017.
 [118] F. Bießmann, F. C. Meinecke, A. Gretton, A. Rauch, G. Rainer, N. K. Logothetis, and K.R. Müller. Temporal kernel cca and its application in multimodal neuronal data analysis. Machine Learning, 79:5–27, 2010.
 [119] R. T. McGibbon and V. S. Pande. Variational crossvalidation of slow dynamical modes in molecular kinetics. J. Chem. Phys., 142:124105, 2015.
 [120] B. E. Husic, R. T. McGibbon, M. M. Sultan, and V. S. Pande. Optimized parameter selection reveals trends in markov state models for protein folding. J. Chem. Phys., 145:194103, 2016.
 [121] M. K. Scherer, B. E. Husic, M. Hoffmann, F. Paul, H. Wu, and F. Noé. Variational selection of features for molecular kinetics. J. Chem. Phys., 150:194108, 2019.
 [122] F. Noé, C. Schütte, E. VandenEijnden, L. Reich, and T. R. Weikl. Constructing the full ensemble of folding pathways from short offequilibrium simulations. Proc. Natl. Acad. Sci. USA, 106:19011–19016, 2009.

[123]
P. Deuflhard and M. Weber.
Robust perron cluster analysis in conformation dynamics.
In M. Dellnitz, S. Kirkland, M. Neumann, and C. Schütte, editors, Linear Algebra Appl., volume 398C, pages 161–184. Elsevier, New York, 2005.  [124] S. Röblitz and M. Weber. Fuzzy spectral clustering by PCCA+: application to Markov state models and data classification. Adv. Data Anal. Classif., 7:147–179, 2013.
 [125] S. Bengio L. Dinh, J. SohlDickstein. Density estimation using real nvp. arXiv:1605.08803, 2016.
 [126] D. J. Rezende and S. Mohamed. Variational inference with normalizing flows. arXiv:1505.05770, 2015.
 [127] W. Grathwohl, R. T. Q. Chen, J. Bettencourt, I. Sutskever, and D. Duvenaud. Ffjord: Freeform continuous dynamics for scalable reversible generative models. arXiv:1810.01367, 2018.
 [128] H. E. Sauceda, S. Chmiela, I. Poltavsky, K.R. Müller, and A. Tkatchenko. Molecular force fields with gradientdomain machine learning: Construction and application to dynamics of small molecules with coupled cluster forces. J. Chem. Phys, 150:114102, 2019.
 [129] S. Chmiela, H. E. Sauceda, I. Poltavsky, K.R. Müller, and A. Tkatchenko. sGDML: Constructing accurate and data efficient molecular force fields using machine learning. Comp. Phys. Comm., 240:38, 2019.
 [130] T. Darden, L. Perera, L. Li, and L. Pedersen. New tricks for modelers from the crystallography toolkit: the particle mesh Ewald algorithm and its use in nucleic acid simulations. Structure, 7:55–60, 1999.

[131]
O. T. Unke and M. Meuwli.
Physnet: A neural network for predicting energies, forces, dipole moments and partial charges.
J. Chem. Theory Comput., 15:3678–3693, 2019.  [132] B. Nebgen, N. Lubbers, J. S. Smith, A. E. Sifain, A. Lokhov, O. Isayev, A. E. Roitberg, K. Barros, and S. Tretiak. Transferable dynamic molecular charge assignment using deep neural networks. J. Chem. Theo. Comput., 14:4687–4698, 2018.
 [133] A. Ambrosetti, N. Ferri, R. A. DiStasio Jr., and A. Tkatchenko. Wavelike charge density fluctuations and van der waals interactions at the nanoscale. Science, 351:1171–1176, 2016.
 [134] J. Hermann, A. R. DiStasio Jr., and A. Tkatchenko. Firstprinciples models for van der waals interactions in molecules and materials: Concepts, theory, and applications. Chem. Rev., 117:4714–4758, 2017.
 [135] M. Stoehr aand T. Van Voorhis and A. Tkatchenko. Theory and practice of modeling van der waals interactions in electronicstructure calculations. Chem. Soc. Rev., 48:4118–4154, 2019.
 [136] J. W. Mullinax and W. G. Noid. Extended ensemble approach for deriving transferable coarsegrained potentials. J. Chem. Phys., 131:104110, 2009.
 [137] I. F. Thorpe, D. P. Goldenberg, and G. A. Voth. Exploration of transferability in multiscale coarsegrained peptide models. J. Phys. Chem. B, 115(41):11911–11926, 2011.
 [138] A. Davtyan, G. A. Voth, and H. C. Andersen. Dynamic force matching: Construction of dynamic coarsegrained models with realistic short time dynamics and accurate long time dynamics. J. Chem. Phys., 145:224107, 2016.
 [139] F. Nüske, L. Boninsegna, and C. Clementi. Coarsegraining molecular systems by spectral matching. J. Chem. Phys., 151:044116, 2019.
 [140] T. Bereau and J. F. Rudzinski. Accurate structurebased coarse graining leads to consistent barriercrossing dynamics. Phys. Rev. Lett., 121:256002, 2018.
 [141] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.R. Müller, and W. Samek. On pixelwise explanations for nonlinear classifier decisions by layerwise relevance propagation. PloS one, 10:e0130140, 2015.
 [142] G. Montavon, W. Samek, and K.R. Müller. Methods for interpreting and understanding deep neural networks. Digit. Signal Process., 73:1–15, 2018.
 [143] S. Lapuschkin, S. Wäldchen, A. Binder, G. Montavon, W. Samek, and K.R. Müller. Unmasking clever hans predictors and assessing what machines really learn. Nat. Commun., 10:1096, 2019.

[144]
W. Samek, G. Montavon, A. Vedaldi, L.K. Hansen, and K.R. Müller.
Explainable AI: Interpreting, Explaining and Visualizing Deep
Learning.
volume 11700 of
Lecture Notes Artificial Intelligence
. Springer, 2019.
Comments
There are no comments yet.