I Introduction
Molecular dynamics (MD) simulations have become an important tool to characterize the microscopic behavior of chemical systems. Recent advances in hardware and software allow significant extensions of the simulation timescales to study biologically relevant processes (LindorffLarsenEtAl_Science11_AntonFolding, ; BuchEtAl_JCIM10_GPUgrid, ; ShirtsPande_Science2000_FoldingAtHome, ). For example, we can now characterize the configurational changes, folding and binding behavior of small to intermediatesized proteins through MD on the timescale of milliseconds to seconds (DrorEtAl_PNAS11_DrugBindingGPCR, ; ShuklaPande_NatCommun14_SrcKinase, ; PlattnerNoe_NatComm15_TrypsinPlasticity, ; PlattnerEtAl_NatChem17_BarBar, ; PaulEtAl_PNAS17_Mdm2PMI, ; NoeAnnRev2020, ). However, the computational complexity of evaluating the potential energy prohibits this approach to scale up to significantly larger systems and/or longer timescales. Therefore, multiple ways have been proposed to speed up atomistic simulations, such as advanced sampling methods (e.g., umbrella sampling (Frenke2001un, ; Torrie1977, ; Johannes2011, ), parallel tempering (Swendsen1986, ; Neal1996, )) or adaptive sampling (Rajamani_Proteins24_1775, ; PretoClementi_PCCP14_AdaptiveSampling, ; BowmanEnsignPande_JCTC2010_AdaptiveSampling, ). An alternative approach is to reduce the dimensionality of the system by coarsegraining (CG) (ClementiCOSB, ; Davtyan2012, ; Izvekov2005, ; Marrink2004, ; MllerPlathe2002, ; Noid2013, )
. The fact that macromolecules usually exhibit robust collective behavior suggests that not every single degree of freedom is per se essential in determining the important macromolecular processes over long timescales. Furthermore, a CG representation of the system simplifies the model and allows for a more straightforward physicochemical interpretation of largescale conformational changes such as protein folding or proteinprotein binding
(ClementiCOSB, ).Once the mapping from the atomistic to the CG representation is defined, a fundamental challenge is the definition of an effective potential in reduced coordinates, such that the essential physical properties of the system under consideration are retained. The choice of the relevant properties crucially dictates the definition of the CG model.
Following a topdown approach, the CG procedure is driven by the objective to reproduce macroscopic properties, such as structural information or experimentally measured observables (Nielsen2003, ; MatysiakClementi_JMB04_Perturbation, ; MatysiakClementi_JMB06_Perturbation, ; Marrink2004, ; Davtyan2012, ; Chen2018, )
. In a bottomup approach, on the other hand, an effective potential is designed to reproduce a selection of properties of an atomistic model, for instance the probability distribution in a suitable space and the corresponding metastable states
(Lyubartsev1995, ; MllerPlathe2002, ; Praprotnik2008, ; Izvekov2005, ; Wang2009, ; Shell2008, ).In the past several years, machine learning (ML) techniques have been increasingly applied in molecular simulation (noe2019science, ; NOE202077, ; NoeAnnRev2020, ; Schutt2019, ). Bottomup CG methods have also started to leverage the advances in ML, to define classical atomistic potentials or force fields from quantum chemical calculations (BehlerParrinello_PRL07_NeuralNetwork, ; Bartok2010, ; Rupp2012, ; Bartok2013, ; Smith2017, ; Bartok2017, ; Schuett2017, ; Smith2018, ; Schuett2018, ; Grisafi2018, ; Imbalzano2018, ; Nguyen2018, ; ZhangHan2018, ; ZhangHan2018_PRL, ; BereauEtAl_JCP18_Physical, ; ChmielaEtAl_SciAdv17_EnergyConserving, ; ChmielaEtAl_NatComm18_TowardExact, ), to learn kinetic models (MardtEtAl_VAMPnets, ; WuEtAl_NIPS18_DeepGenMSM, ; WehmeyerNoe_TAE, ; HernandezPande_VariationalDynamicsEncoder, ; RibeiroTiwary_JCP18_RAVE, ), or to design effective CG potential from atomistic simulations (John2017, ; ZhangHan2018_CG, ; wang2019, ). In this context, we have recently shown that a deep neural network (NN) can be used in combination with the well established “force matching” approach (Noid2008, ) to define a coarsegrained implicit water potential that is able to reproduce the correct folding/unfolding process of a small protein from atomistic simulations in explicit water (wang2019, )
. In the force matching approach, the effective energy function of the CG model is optimized variationally, by finding the CG force field that minimizes the difference with the instantaneous atomistic forces projected on the CG coordinates. As there are multiple atomistic configurations consistent with a CG configuration, this estimator is very noisy and this approach requires a large amount of training data. It is thus restricted to parametric models like NNs, as the computational complexity of nonparametric models is directly linked to training set size. Here we propose a method to overcome this limitation in the dataset via bootstrap aggregation in combination with a nonparametric, kernelbased regressor.
In particular, we use the GradientDomain Machine Learning (GDML) approach (ChmielaEtAl_SciAdv17_EnergyConserving, ; CHMIELA201938, ). In the application to quantum data, GDML is able to use a small number (usually less than a few thousands) of example points to build an accurate force field for a specific molecule. Because of the degeneracy of the mapping, the training data required to reconstruct a coarsegrained force field is much larger, and the fact that memory requirements scale quadratically with data set size prevents a direct application of GDML to the definition of CG models.
To solve this problem, we pursue a hierarchical ensemble learning approach in which the full training set is first divided into smaller batches that are trained independently. A second GDML layer is then applied to the mean prediction of this ensemble, providing the second model with a consistent set of inputs and outputs. We show that GDML with ensemble learning can be efficiently used for the coarsegraining of molecular systems.
The structure of the paper is as follows. In the section “Theory And Methods”, we briefly review the principle of force matching that we use for coarse graining, as well as mathematical underpinnings of the kernel ridge regression used in the GDML method. Then we describe the idea of ensemble learning, and explain how it solves the problem associated with the large number of training points required by force matching. In the “Results” section, we demonstrate that a GDML approach trained with ensemble learning performs well on the coarsegraining of a small molecular system, alanine dipeptide simulated in water solvent, as it produces the same free energy surface as obtained in the allatom simulations. As it was already demonstrated in the case of a NN approach (wang2019, ), the key to success of a GDMLbased coarse graining is that it is able to naturally capture nonlinearities and multibody effects arising from the renormalization of degrees of freedom at the base of coarsegraining.
Ii Theory and methods
ii.1 CoarseGraining with Thermodynamic Consistency
Although the definition of a coarsegraining mapping scheme is per se an interesting problem (BoninsegnaBanish2018, ; sinitskiy2012optimal, ; Noid2013, ; WangBombarelli2019, ), here we start by assuming that a mapping is given. The allatom system we want to coarsegrain consists of atoms, and its configurations are represented by a
dimensional vector
. The lower dimensional CG representation of the system is given by the mapping:(1) 
where is the number of CG beads. The CG mapping function is assumed to be linear, i.e. there exists a coarsegraining matrix that maps the all atom space to the CG space: .
The definition of a CG model requires an effective potential in the CG space, where are the optimization parameters. The potential can then be used to generate an MD trajectory with a dynamical model. Parameterizations are available in varying degrees of sophistication, ranging from classical force fields with fixed functional forms, to ML approaches with strong physical basis.
One popular bottomup method for building a CG model is to require thermodynamic consistency, that is to design a CG potential such that its equilibrium distribution matches the one of the allatom model. In practice, this means that an optimum CG potential satisfies the condition:
(2) 
where is the Boltzmann constant, is the temperature, and the probability density distribution in the CG space is given by the equilibrium distribution of the allatom model mapped to the CG coordinates:
(3) 
where ()= is the Boltzmann weight associated with the atomistic energy .
Different methods have been proposed to construct a CG potential that satisfy Eq.3, notably the relative entropy method (Shell2008, ), and the forcematching method(Izvekov2005, ; Noid2008, ). In this work we will demonstrate how we could learn the molecular CG potential using the idea of forcematching and the GDML kernel method.
ii.2 Force Matching
It can be shown that the potential satisfies thermodynamic consistency if the associated CG forces minimize the mean square error(Izvekov2005, ; Noid2008, ):
(4) 
where denotes the instantaneous allatom forces projected onto the CG space, and is the weighted average over the equilibrium distribution of the atomistic model, i.e., .
The effective CG force field that minimizes corresponds to the mean force(Noid2008, ):
(5) 
where indicates the equilibrium distribution of constrained to the CG coordinates , i.e. the ensemble of all atomistic configurations that map to the same CG configuration. For this reason, an optimized CG potential is also called the potential of mean force (PMF).
By following statistical estimator theory (Vapnik_IEEE99_StatisticalLearningTheory, ), it can also be shown (wang2019, ) that, the error (Eq. 4) can be decomposed into two terms:
(6) 
where
(7) 
While the PMF error term depends on the definition of the CG potential and can be in principle reduced to zero, the noise term does not depend on the CG potential and it is solely associated with the decrease in the number of degrees of freedom in the CG mapping, and it is in general larger than zero. The force matching estimator of Eq. 4 is thus intrinsically very noisy.
ii.3 Gdml
In previous work (wang2019, ), we have introduced CGnet to minimize the error in Eq. 4 using a neural network to parametrize the CG forces. We have demonstrated that the CGnet approach successfully recovers optimal CG potentials. A large training dataset enables CGnet to resolve the ambiguity in the coarsegrained force labels by converging to the respective mean forces. Here, we explore the Gradientdomain Machine Learning approach (GDML)(ChmielaEtAl_SciAdv17_EnergyConserving, ; ChmielaEtAl_NatComm18_TowardExact, ) as an alternative.
GDML has been used to obtain an accurate reconstruction of flexible molecular force fields from small reference datasets of highlevel ab initio calculations (ChmielaEtAl_NatComm18_TowardExact, ; CHMIELA201938, ; ChmielaEtAl_SciAdv17_EnergyConserving, ). Unlike traditional classical force fields, this approach imposes no hypothesized interaction pattern for the nuclei and is thus unhindered in modeling any complex physical phenomena. Instead, GDML imposes energy conservation as inductive bias, a fundamental property of closed classical and quantum mechanical systems that does not limit generalization. This makes highly data efficient reconstruction possible, without sacrificing generality.
The key idea is to use a Gaussian process () to model the force field as a transformation of an unknown potential energy surface such that
(8) 
Here, and are the mean and covariance functions of the corresponding energy predictor, respectively.
To help disambiguate physically equivalent inputs, the Cartesian geometries are represented by a descriptor with entries:
(9) 
that introduces rototranslational invariace. Accordingly, the posterior mean of the GDML model takes the form
(10) 
where is the Jacobian of the descriptor (see Supplementary Information for details). Due to linearity, the corresponding expression for the energy predictor can be simply obtained via (analytic) integration. GDML uses a Matérn kernel with restricted differentiability to construct the force field kernel function
We use this kernel, because empirical evidence indicates that kernels with limited smoothness yield better predictors, even if the prediction target is infinitely differentiable. It is generally assumed that overly smooth priors are detrimental to data efficiency, as the associated hypothesis space is harder to constrain with a finite number of (potentially noisy) training examples (rasmussen2004gaussian, )
. The differentiability of functions is directly linked to the rate of decay of their spectral density at high frequencies, which has been shown to play a critical role in spatial interpolation
(stein1999, ).ii.4 Ensemble Learning
Ensemble learning is a general and widely used machine learning trick to increase the predictive performance of a trained model by combining multiple submodels (Opitz1999, ; Polikar, ; Rokach2010, ; Breiman1996, ; geman1992neural, ; Hansen1990, ; Schapire1990, ; Kuncheva2003, ; Bauer1999, ). In this work we use the idea at the basis of a particular ensemble learning method, called bootstrapaggregation in the machine learning literature (Breiman1996, ), summarized in the following paragraphs and Algorithm 1. This method enables to train a GDML approach over millions of data points, a task that would be otherwise impossible.
In general, we generate a finite set of alternative GDML reconstructions from randomly drawn subsets of the full MD trajectory and average them to generate an estimate for the “expected” force prediction at each point. The variability in the individual training sets promotes flexibility in the structure across all models in the ensemble and enables us capture the variability in the dataset. We are then able to compute the expected value for each input by simply taking the mean of the ensemble.
Suppose we have a large data set for training that contains samples of pairs of points . We would like to train a predictive model such that , using the data . Instead of training a single model using the whole data points from we first randomly sample batches: where each batch contains points. Usually is too large to efficiently train a single model, but it is possible to train submodels on the different batches if . After training all the batches, the final predictive model is obtained as the average of all the submodels:
(12) 
This enables us to generate consistent labels for a heldout subset of the trajectory, which then serves as the basis for another GDML reconstruction.
We demonstrate how bootstrapping aggregation is used on a simple example, where we learn an effective curve to fit a one dimensional data set. As shown in Fig. 1a, 600 raw points are uniformly sampled from , and the value of each point is assigned according to , where is a random noise. These 600 points serve as the noisy training set. Instead of learning the curve using all 600 points at once, we bootstrap sample 100 batches from the full data set, where each batch contains only 20 points. We use a sixorder polynomial function to fit 20 points in each batch, and the 100 fitted curves are shown in blue in Fig.1a. While each of these 100 blue curves oscillates around the mean and overfit the data, the mean of the 100 predictors (red curve) is smooth and agrees with the ground truth (green curve) quite well.We use the idea of ensemble learning to apply GDML to CG problems, as a 2layer procedure. Instead of training one single GDML model using all data, which usually exceed the upper memory limit of GDML, we train models using data batches, where each batch contains only points. In this work, . Since 1000 points is far below the GDML limit, each GDML model is easy to train. After obtaining all GDML models , we use them to predict the forces corresponding to the th model for any given CG configuration as . The mean force (CG force) for a configuration is then the average of the forces for all the models: . This average force prediction could be directly used in the CG molecular simulation but the resulting model would be of low efficiency, since for each single configuration the forces need to be evaluated for all models to obtain an average CG force .
This low evaluation efficiency motivates us to propose a 2layer procedure to speed up the evaluation of the mean force prediction. We generate a new batch of data which contains points ( in this work). For all CG configurations in , we use predictors to evaluate their forces and compute the mean forces. This produces a new data set where the configurations are associated to the corresponding mean forces. Constructing can be fast because the mean forces are computed only for points, usually a few thousands.
Once the mean force set is obtained, we can train a single final model using the entire . Since contains the mean forces, the final model also predicts the mean forces for the CG configurations. is easy to train due to small size of ( is far below the GDML limit) and the force evaluation for the final model is much more efficient than by evaluating models . The general procedure of the 2layer scheme is illustrated in Algorithm 1.
The hyperparameters that control the performance of the final model are the two kernel sizes , for each layer (see Eq. 11). Another hyperparameter is the regularization coefficient of the ridge term, and is set to the standard value () as in the original GDML paper (ChmielaEtAl_SciAdv17_EnergyConserving, ). We conduct a 2D crossvalidation search to determine and . The algorithm for the crossvalidation of the ensemble learning GDML is shown in Algorithm 2. The parameters are selected as is the number of folds for the crossvalidation, and the total number of points in is .
ii.5 Stratified Sampling
Another crucial factor that impacts the overall performance of our machine learning model is the distribution of the training data. As our training data are obtained from extensive MD simulations, they are distributed according to the Boltzmann distribution in the molecule configuration space. If a small batch of data is randomly sampled from the whole data set, the large majority of the data will reside in low free energy regions, while data in high free energy region, such as transition barriers, are underrepresented. Fig. 2a shows that, in the case of alanine dipeptide, most of the data in a small batch of randomly selected points are located in the free energy minima on the left side of the dihedral angle space. If batches from this biased distribution are used in the ensemble learning, the errors for predicting the PMF in high free energy regions would be very large, because the models will not be trained efficiently in these sparse data regions.
In order to solve this issue, we sample the data for the batches uniformly in the dihedral angles space of alanine dipeptide, as shown in Fig. 2b. In this way, all relevant regions in the configurational space are equally represented in the training set, including transition states. The advantage of this strategic sampling is illustrated in more details in section III.
ii.6 Simulating the CGGDML Model
After training the 2layer GDML model, we use an overdamped Langevin dynamics to generate a trajectory and sample the CG potential :
(13) 
where () is the CG configuration at time (), is the time step, is the diffusion constant, and
is a vector of independent Gaussian random variables with zeromean and identity covariance matrix (Wiener process). To sample the trained potential more efficiently, we generate 100 independent trajectories in parallel, with initial configurations randomly sampled from the original data set.
ii.7 Including Physical Constraints
When an overdamped Langevin dynamics (Eq. 13) is used to generate a trajectory with a CG potential trained on a finite dataset, one undesired situation may happen: since the dynamics is stochastic, there is a chance that the simulated CG trajectory may diffuse away from the domain of the data used in the training, generating unphysical configurations. For example, the stretching of a bond too far away from the equilibrium distance is associated with a very high energy cost and is never observed in simulation with a force field at finite temperature. In simulation with a machinelearned CG potential, there is no mechanism for preventing such as unphysical bondstretching. Similarly to what we proposed in our recent work (wang2019, ), this problem can be solved by including a prior potential energy incorporating physical prior knowledge on the system:
where has harmonic terms modeling bond and angle stretching, with parameters extracted from the training data by Boltzmann inversion. is the difference between the total CG potential and . The forces obey a similar relation:
so the loss function of the model becomes:
(14) 
Differently from what was done in a neural network model (wang2019, ), the prior potential is not added directly to the trained model: the prior forces are first evaluated and subtracted from the allatom forces, and the GDML is trained over this force difference. Once the model is trained, the total energy (and forces) is obtained by adding back the prior energy (and forces) to the one obtained from the trained model.
Iii Results
We illustrate the results of the approach discussed above on a simple molecular system, namely the coarsegraining of the alaninedipeptide molecule from the atomistic model in explicit water into a 6bead CG model. The allatom model of alanine dipeptide consists of 22 atoms and 651 water molecules, for a total of a few thousand degrees of freedom. As illustrated in Fig. 3, for the CG representation we select the 5 central backbone atoms of the molecule, with additionally a 6 atom to break the symmetry and differentiate right or left handed representations. The overall pipeline for the coarse graining and the training procedure that is discussed below is also summarized in Figure 3.
We compute the free energy of the alanine dipeptide as a function of the two dihedral angles , where is defined by atoms 1, 2, 3, 5, and by atoms 2, 3, 5, 6 (see Fig. 3). As shown in Fig. 4a, there are six metastable states in the free energy landscape of the allatom model of alanine dipeptide. Fig.4b shows that the final 2layer GDML CG model correctly reproduces the free energy landscape of alanine dipeptide: the free energy obtained from the trajectories (generated by numerical integration of Eq.13) of the CG model also exhibits six minima, with depths close to the ones of the corresponding minima in the allatom model. Representative configurations from the six metastable states are shown in Fig.4a for the allatom (CPK representation) and the CG model (thick bond representation). Moreover, as shown in SI Fig. LABEL:fig:angle_bond_distribution1, the bond and angle distribution from the CG simulation are also consistent with the all atom simulation.
The GDML model shown in 4b is optimized based on the minimum crossvalidation error over a 2dimensional grid, spanned by the parameters and , which are the kernel width for the 1 and the 2 layer models. We find that the values give the smallest crossvalidation error. Details on the crossvalidation search can be found in SI Fig. LABEL:fig:Cross_validation1.
Fig. 4c reports the free energy landscape corresponding to a CG model obtained with a 2layer GDML but where the selection of the data for the submodel is performed according to the Boltzmann distribution (that is, uniform sampling along the MD trajectory) instead of the stratified sampling scheme discussed above (uniform sampling in the space). While the free energy around the region of the deepest free energy minima in the space is quite accurate, the lowly populated metastable state (indicated as state 3 in 4a) is completely missing in 4c, because of the scarcity of training points in this region.
As a comparison, Fig. 4d shows the results when a singlelayer GDML model is trained on only 2000 points. Although this model identifies the general location of the metastable states, the free energy landscape is significantly distorted with respect to the allatom one. This poor reconstruction performance is due to the limited size of the training set, which is not extensive enough to enable a stable estimate of the expected forces for the reduced representation of the input. We also trained a singlelayer GDML model on 5000 points. As shown in Fig. 4 e, the free energy of this model presents a slightly improvement with respect Fig. 4d because of the increased number of training points. However, the overall quality is still low comparing to the atomistic model. We expect the reconstructed free energy to improve further if we trained a model using much more data, but this is hindered by the memory requirement: it requires about 160 GB memory to train a model with 5000 points, which is almost at the upper limit of our computational ability.
To quantify the performance of the different approaches, we compute the mean square error (MSE) of the free energy difference of the different CG models compared to the atomistic model (Fig. 4a, Table 1, see (wang2019, ) for details). As expected, the 2layer GDML model has the smallest free energy MSE, which is about , when it is trained with all 1000 batches. The single layer GDML gives the largest free energy difference ( if trained with 2000 points, and if trained with 5000 points). If no stratified sampling is used, the free energy difference is , and most of this value is due to the discrepancies in the free energy region.
As a baseline, we also compute the free energy difference obtained by a CG model designed by means of a neural network, CGnet (wang2019, ). Previously, we have applied CGnet to alaninedipeptide, but it was a model based on a 5 atom CG scheme, and we included two dihedral angles as input features to break the symmetry. To make the CGnet model consistent with the CG scheme used in this work, we modified it to contain 6 atoms (as in the GDML model), and no dihedral angles features were included (only distances are used as input). This CGnet model is trained with the same number of points as the GDML model (i.e. 1,000,000 points from 1000 batches). The resulting CGnet free energy MSE is , a value slightly larger than the 2layer GDML model. This result shows that the accuracy of a kernel approach can indeed be comparable to or even better than a neural network approach on the same system.
We have also investigated the effect of the batch number (or the training set size). We computed the crossvalidation error with different training set size, from to batches for the GDML model, or equivalentely from to points for CGnet. Fig. 5a shows that as the batch number increases from to , the crossvalidation error for GDML model drops quickly, and reaches convergence with a batch number . The crossvalidation error for CGnet is significantly larger than for the GDML model when the number of batches (or, equivalently, the training set size) is small. When the batch number is larger than 200, the crossvalidation error for CGnet becomes smaller than for GDML. Similarly, if we compare the free energy MSEs, as shown in Fig. 5b, the free energy constructed by GDML with a small training set is significantly better than the corresponding free energy constructed by CGnet. On the other hand, with a large training set, the MSEs are comparable to each other. Typical free energy profiles are shown in Fig. 6ad, and their corresponding MSE values are shown Table 1. These results show that with enough data, the 2layer GDML model and CGnet perform similarly well. However, the 2layer GDML model is more data efficient and has a better ability to extrapolate the force prediction to unsampled configurations, thus outperforming CGnet for small training sets.
Model  100 k  1000 k  2 k  5 k 
CGnet      
2layer GDML      
Boltz. Samp. 2layer GDML        
1layer GDML     
Iv Conclusions
In this work we combine the idea of ensemble learning with GDML, to apply it to the coarse graining problem. GDML is a kernel method to learn molecular force fields from data, and allows to model nonlinearity and multibody effects without the need of providing a functional form for the potential. The GMDL approach was originally proposed to learn molecular forces from quantum simulation data. When quantum calculations are used, the error on the force matching loss could in in principle be zero, and a few thousand points are enough to construct and build an accurate, smooth and conserved force field. However, when applied to coarsegraining , the force matching loss contains a non zero term due to the dimensionality reduction and the learning problem becomes very noisy. For this reason a lot more data points are needed from atomistic simulations to learn a CG potential of mean force. The large amount of input data would presently hinder the application of GDML to the CG problem. In order to circumvent this problem, we use ensemble learning. The basic idea consists in breaking down the learning problem into small batches, that can be more easily solved, and combine the resulting different models into a final solution. Following this approach, we do not train one single GDML model using all the data, but propose a 2layer training scheme: in the first layer, we generate data batches, each containing a number of points far below the GDML limit. models are trained on the different batches and are combined into a final model by taking the average. We show that the prediction of the CG 2layer model accurately reproduces the thermodynamics of the atomistic model.
Consistently with previous work (wang2019, ), we show that, when applying machine learning methods to design force fields for molecular systems, the addition of physical constraints enforce proper asymptotic of the model. In the design of CG potentials, physical constraints can be introduced by means of a prior potential energy term that prevent the appearance of spurious effects in nonphysical regions of the configurational landscape.
A good GDML model should be able to construct a smooth and globally connected conserved force field. However, when the 2layer approach is used some of the molecular configurations with high free energy are poorly sampled in the training set, introducing large errors in the resulting model. In order to solve this problem, we sample the data uniformly in the low dimensional space defined by two collective coordinates rather than uniformly from the simulation time series. In the example of alanine dipeptide discussed here, the dihedral angles are chosen as collective coordinates.
In our previous work, we proposed CGnet (wang2019, ), a neural network approach to design CG models. The overall free energy reconstruction obtained with the GDML model is comparably accurate as what was obtained with CGnet when the training set size is sufficiently large. However, the GDML model is significantely more accurate when the training set size is small, indicating that a kernel approach is dataefficient and could in principle provide more accurate CG models especially with small training sets.
However, there are still several challenges in order to apply GDML for the coarsegraining of macromolecular systems. In larger systems, a more general definition is needed for the collective coordinates defining the low dimensional space for the uniform sampling of the training batches. These collective coordinates could in principle be extracted from the trajectory data (RohrdanzEtAl_AnnRevPhysChem13_MountainPasses, ; NoeClementi_COSB17_SlowCVs, )
, for instance by means of timelagged Independent Component Analysis (tICA)
(PerezHernandez2013, ; SchwantesPande_JCTC13_TICA, ; ZieheMueller_ICANN98_TDSEP, ; Belouchrani1997, ; Molgedey_94, ), kernel PCA(Klaus1997, ; Klaus1998, ; Klaus2001, ) or diffusion maps (RohrdanzClementi_JCP134_DiffMaps, ).The decomposition of the large input data set into an ensemble of small batches has been used here to solve memory issues when training a GDML model. However, the computation is still expensive and we expect it to become even more expensive as the size of the molecular system increases. As the number of data batches and batch size grow, the Nyström approximation of the kernel or other numerical approaches may be a promising solution to increase the computational efficiency.
As for the neural network model, the GDML model trained by force matching can capture the thermodynamics of the system, but there is no guarantee that the dynamics is also preserved. Alternative approaches need to be defined to solve this problem (Nuske2019, ).
Finally, the GDML model presented here is trained on a specific molecule, and it is not directly transferable to different systems. Ultimately, a transferable CG model would be needed for the general application to large systems that can not be simulated by atomistic simulations. The tradeoff between accuracy and transferability in CG models is an open research question that we will investigate in future work.
V Supplementary Material
See Supplementary Material for more details about the hyperparameter search, a discussion on the prior energy, and more information on the descriptors used in the GDML.
Acknowledgements.
We thank Eugen Hruska and Feliks Nüske for comments on the manuscript. This work was supported by the National Science Foundation (CHE1738990, CHE1900374, and PHY1427654), the Welch Foundation (C1570), the MATH+ excellence cluster (AA16, EF12), the Deutsche Forschungsgemeinschaft (SFB 1114/A04), the European Commission (ERC CoG 772230 “ScaleCell”) and the Einstein Foundation Berlin (Einstein Visiting Fellowship to CC). Simulations have been performed on the computer clusters of the Center for Research Computing at Rice University, supported in part by the BigData PrivateCloud Research Cyberinfrastructure MRIaward (NSF grant CNS1338099), and on the clusters of the Department of Mathematics and Computer Science at Freie Universität, Berlin. K.R.M. acknowledges partial financial support by the German Ministry for Education and Research (BMBF) under Grants 01IS14013AE, 01IS18025A, 01IS18037A, 01GQ1115 and 01GQ0850; Deutsche Forschungsgesellschaft (DFG) under Grant Math+, EXC 2046/1, Project ID 390685689 and by the Technology Promotion (IITP) grant funded by the Korea government (No. 2017000451, No. 2017001779). S.C. acknowledges the BASLEARN  TU Berlin/BASF Joint Laboratory, cofinanced by TU Berlin and BASF SE. Part of this research was performed while the authors were visiting the Institute for Pure and Applied Mathematics (IPAM), which is supported by the National Science Foundation (Grant No. DMS1440415).The data that support the findings of this study are available from the corresponding author upon reasonable request.
References
 [1] K. LindorffLarsen, S. Piana, R. O. Dror, and D. E. Shaw. How fastfolding proteins fold. Science, 334:517–520, 2011.
 [2] I. Buch, M. J. Harvey, T. Giorgino, D. P. Anderson, and G. De Fabritiis. Highthroughput allatom molecular dynamics simulations using distributed computing. J. Chem. Inf. Model., 50:397–403, 2010.
 [3] M. Shirts and V. S. Pande. Screen savers of the world unite! Science, 290:1903–1904, 2000.
 [4] R. O. Dror, A. C. Pan, D. H. Arlow, D. W. Borhani, P. Maragakis, Y. Shan, H. Xu, and D. E. Shaw. Pathway and mechanism of drug binding to gproteincoupled receptors. Proc. Natl. Acad. Sci. USA, 108:13118–13123, 2011.
 [5] D. Shukla, Y. Meng, B. Roux, and V. S. Pande. Activation pathway of src kinase reveals intermediate states as targets for drug design. Nat. Commun., 5:3397, 2014.

[6]
N. Plattner and F. Noé.
Protein conformational plasticity and complex ligand binding kinetics explored by atomistic simulations and markov models.
Nat. Commun., 6:7653, 2015.  [7] N. Plattner, S. Doerr, G. D. Fabritiis, and F. Noé. Proteinprotein association and binding mechanism resolved in atomic detail. Nat. Chem., 9:1005–1011, 2017.
 [8] F. Paul, C. Wehmeyer, E. T. Abualrous, H. Wu, M. D. Crabtree, J. Schöneberg, J. Clarke, C. Freund, T. R. Weikl, and F. Noé. Proteinligand kinetics on the seconds timescale from atomistic simulations. Nat. Commun., 8:1095, 2017.
 [9] F. Noé, A. Tkatchenko, K.R. Müller, and C. Clementi. Machine learning for molecular simulation. Ann. Rev. Phys. Chem., in press, 2020.
 [10] D. Frenkel and B. Smit. Understanding Molecular Simulation: From Algorithms to Applications. Academic Press, 2 edition, 2001.
 [11] G. M. Torrie and J. P. Valleau. Nonphysical sampling distributions in monte carlo freeenergy estimation: Umbrella sampling. J. Comput. Phys., 23(2):187–199, 1977.
 [12] J. Kästner. Umbrella sampling. Wiley Interdisciplinary Reviews: Computational Molecular Science, 1(6):932–942, 2011.
 [13] R. H. Swendsen and J.S. Wang. Replica monte carlo simulation of spinglasses. Phys. Rev. Lett., 57:2607–2609, Nov 1986.
 [14] R. M. Neal. Sampling from multimodal distributions using tempered transitions. Stat. Comput., 6(4):353–366, Dec 1996.
 [15] R. Rajamani, K. J. Naidoo, and J. Gao. Implementation of an adaptive umbrella sampling method for the calculation of multidimensional potential of mean force of chemical reactions in solution. Proteins, 24:1775–1781, 2003.
 [16] J. Preto and C. Clementi. Fast recovery of free energy landscapes via diffusionmapdirected molecular dynamics. Phys. Chem. Chem. Phys., 16:19181–19191, 2014.
 [17] G. R. Bowman, D. L. Ensign, and V. S. Pande. Enhanced Modeling via Network Theory: Adaptive Sampling of Markov State Models. J. Chem. Theory Comput., 6(3):787–794, 2010.
 [18] C. Clementi. Coarsegrained models of protein folding: Toymodels or predictive tools? Curr. Opin. Struct. Biol., 18:10–15, 2008.
 [19] A. Davtyan, N. P. Schafer, W. Zheng, C. Clementi, P. G. Wolynes, and G. A. Papoian. AWSEMMD: Protein structure prediction using coarsegrained physical potentials and bioinformatically based local structure biasing. J. Phys. Chem. B, 116(29):8494–8503, 2012.
 [20] S. Izvekov and G. A. Voth. A multiscale coarsegraining method for biomolecular systems. J. Phys. Chem. B, 109(7):2469–2473, 2005.
 [21] S. J. Marrink, A. H. de Vries, and A. E. Mark. Coarse grained model for semiquantitative lipid simulations. J. Phys. Chem. B, 108(2):750–760, 2004.
 [22] F. MüllerPlathe. Coarsegraining in polymer simulation: From the atomistic to the mesoscopic scale and back. ChemPhysChem, 3(9):754–769, sep 2002.
 [23] W. G. Noid. Perspective: Coarsegrained models for biomolecular systems. J. Chem. Phys., 139(9):090901, 2013.
 [24] S. O. Nielsen, C. F. Lopez, G. Srinivas, and M. L. Klein. A coarse grain model for nalkanes parameterized from surface tension data. J. Chem. Phys., 119(14):7043–7049, 2003.
 [25] S. Matysiak and C. Clementi. Optimal combination of theory and experiment for the characterization of the protein folding landscape of s6: How far can a minimalist model go? J. Mol. Biol., 343:235–248, 2004.
 [26] S. Matysiak and C. Clementi. Minimalist protein model as a diagnostic tool for misfolding and aggregation. J. Mol. Biol., 363:297–308, 2006.
 [27] J. Chen, J. Chen, G. Pinamonti, and C. Clementi. Learning effective molecular models from experimental observables. J. Chem. Theory Comput., 14(7):3849–3858, 2018.
 [28] A. P. Lyubartsev and A. Laaksonen. Calculation of effective interaction potentials from radial distribution functions: A reverse monte carlo approach. Phys. Rev. E, 52(4):3730–3737, 1995.
 [29] M. Praprotnik, L. D. Site, and K. Kremer. Multiscale simulation of soft matter: From scale bridging to adaptive resolution. Ann. Rev. Phys. Chem., 59(1):545–571, 2008.
 [30] Y. Wang, W. G. Noid, P. Liu, and G. A. Voth. Effective force coarsegraining. Phys. Chem. Chem. Phys., 11(12):2002, 2009.
 [31] M. S. Shell. The relative entropy is fundamental to multiscale and inverse thermodynamic problems. J. Phys. Chem., 129(14):144108, 2008.

[32]
F. Noé, S. Olsson, J. Köhler, and H. Wu.
Boltzmann generators: Sampling equilibrium states of manybody systems with deep learning.
Science, 365(6457), 2019.  [33] F. Noé, G. D. Fabritiis, and C. Clementi. Machine learning for protein folding and dynamics. Curr. Op. Struct. Biol., 60:77 – 84, 2020.
 [34] K. T. Schütt, M. Gastegger, A. Tkatchenko, K.R. Müller, and R. J. Maurer. Unifying machine learning and quantum chemistry with a deep neural network for molecular wavefunctions. Nat. Commun., 10(1):5024, 2019.
 [35] J. Behler and M. Parrinello. Generalized neuralnetwork representation of highdimensional potentialenergy surfaces. Phys. Rev. Lett., 98:146401, 2007.
 [36] A. P. Bartók, M. C. Payne, R. Kondor, and G. Csányi. Gaussian approximation potentials: The accuracy of quantum mechanics, without the electrons. Phys. Rev. Lett., 104(13), 2010.
 [37] M. Rupp, A. Tkatchenko, K.R. Müller, and O. A. von Lilienfeld. Fast and accurate modeling of molecular atomization energies with machine learning. Phys. Rev. Lett., 108(5), 2012.
 [38] A. P. Bartók, M. J. Gillan, F. R. Manby, and G. Csányi. Machinelearning approach for one and twobody corrections to density functional theory: Applications to molecular and condensed water. Phys. Rev. B, 88(5), 2013.
 [39] J. S. Smith, O. Isayev, and A. E. Roitberg. ANI1: an extensible neural network potential with DFT accuracy at force field computational cost. Chem. Sci., 8(4):3192–3203, 2017.
 [40] A. P. Bartók, S. De, C. Poelking, N. Bernstein, J. R. Kermode, G. Csányi, and M. Ceriotti. Machine learning unifies the modeling of materials and molecules. Sci. Adv., 3(12), 2017.

[41]
K. T. Schütt, F. Arbabzadah, S. Chmiela, K. R. Müller, and
A. Tkatchenko.
Quantumchemical insights from deep tensor neural networks.
Nat. Commun., 8:13890, 2017. 
[42]
J. S. Smith, B. Nebgen, N. Lubbers, O. Isayev, and A. E. Roitberg.
Less is more: Sampling chemical space with active learning.
J. Chem. Phys., 148(24):241733, 2018.  [43] K. T. Schütt, H. E. Sauceda, P.J. Kindermans, A. Tkatchenko, and K.R. Müller. SchNet  a deep learning architecture for molecules and materials. J. Chem. Phys., 148(24):241722, 2018.
 [44] A. Grisafi, D. M. Wilkins, G. Csányi, and M. Ceriotti. Symmetryadapted machine learning for tensorial properties of atomistic systems. Phys. Rev. Lett., 120(3), 2018.
 [45] G. Imbalzano, A. Anelli, D. Giofré, S. Klees, J. Behler, and M. Ceriotti. Automatic selection of atomic fingerprints and reference configurations for machinelearning potentials. J. Chem. Phys., 148(24):241730, 2018.
 [46] T. T. Nguyen, E. Székely, G. Imbalzano, J. Behler, G. Csányi, M. Ceriotti, A. W. Götz, and F. Paesani. Comparison of permutationally invariant polynomials, neural networks, and gaussian approximation potentials in representing water interactions through manybody expansions. J. Chem. Phys., 148(24):241725, 2018.
 [47] L. Zhang, J. Han, H. Wang, W. A. Saidi, R. Car, and W. E. Endtoend symmetry preserving interatomic potential energy model for finite and extended systems. arXiv:1805.09003, 2018.
 [48] L. Zhang, J. Han, H. Wang, R. Car, and W. E. Deep potential molecular dynamics: a scalable model with the accuracy of quantum mechanics. Phys. Rev. Lett., 120:143001, 2018.
 [49] T. Bereau, R. A. DiStasio, A. Tkatchenko, and O. A. V. Lilienfeld. Noncovalent interactions across organic and biological subsets of chemical space: Physicsbased potentials parametrized from machine learning. J. Chem. Phys., 148:241706, 2018.
 [50] S. Chmiela, A. Tkatchenko, H. E. Sauceda, I. Poltavsky, K. T. Schütt, and K.R. Müller. Machine learning of accurate energyconserving molecular force fields. Sci. Adv., 3:e1603015, 2017.
 [51] S. Chmiela, H. E. Sauceda, K.R. Müller, and A. Tkatchenko. Towards exact molecular dynamics simulations with machinelearned force fields. Nature Commun., 9:3887, 2018.
 [52] A. Mardt, L. Pasquali, H. Wu, and F. Noé. Vampnets: Deep learning of molecular kinetics. Nat. Commun., 9:5, 2018.
 [53] H. Wu, A. Mardt, L. Pasquali, and F. Noé. Deep generative markov state models. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 3975–3984. Curran Associates, Inc., 2018.

[54]
C. Wehmeyer and F. Noé.
Timelagged autoencoders: Deep learning of slow collective variables for molecular kinetics.
J. Chem. Phys., 148:241703, 2018.  [55] C. X. Hernández, H. K. WaymentSteele, M. M. Sultan, B. E. Husic, and V. S. Pande. Variational encoding of complex dynamics. arXiv:1711.08576, 2017.
 [56] J. M. L. Ribeiro, P. Bravo, Y. Wang, and P. Tiwary. Reweighted autoencoded variational bayes for enhanced sampling (rave). J. Chem. Phys., 149:072301, 2018.
 [57] S. T. John and G. Csányi. Manybody coarsegrained interactions using gaussian approximation potentials. J. Phys. Chem. B, 121(48):10934–10949, 2017.
 [58] L. Zhang, J. Han, H. Wang, R. Car, and W. E. DeePCG: constructing coarsegrained models via deep neural networks. arXiv:1802.08549, 2018.
 [59] J. Wang, S. Olsson, C. Wehmeyer, A. Pérez, N. E. Charron, G. de Fabritiis, F. Noé, and C. Clementi. Machine learning of coarsegrained molecular dynamics force fields. ACS Cent. Sci., April 2019.
 [60] W. G. Noid, J.W. Chu, G. S. Ayton, V. Krishna, S. Izvekov, G. A. Voth, A. Das, and H. C. Andersen. The multiscale coarsegraining method. I. A rigorous bridge between atomistic and coarsegrained models. J. Chem. Phys., 128(24):244114, 2008.
 [61] S. Chmiela, H. E. Sauceda, I. Poltavsky, K.R. Müller, and A. Tkatchenko. sgdml: Constructing accurate and data efficient molecular force fields using machine learning. Comput. Phys. Commun., 240:38 – 45, 2019.
 [62] L. Boninsegna, R. Banisch, and C. Clementi. A datadriven perspective on the hierarchical assembly of molecular structures. J. Chem. Theory Comput., 14(1):453–460, 2018.
 [63] A. V. Sinitskiy, M. G. Saunders, and G. A. Voth. Optimal number of coarsegrained sites in different components of large biomolecular complexes. J. Phys. Chem. B, 116(29):8363–8374, 2012.
 [64] W. Wang and R. GómezBombarelli. Coarsegraining autoencoders for molecular dynamics. Npj Computat. Mater., 5(1), 2019.

[65]
V. N. Vapnik.
An overview of statistical learning theory.
IEEE Trans. Neur. Net., 10, 1999.  [66] C. E. Rasmussen. Gaussian processes in machine learning. In Advanced lectures on machine learning, pages 63–71. Springer, 2004.
 [67] M. L. Stein. Interpolation of Spatial Data  Some Theory for Kriging. SpringerVerlag New York, 1999.
 [68] D. Opitz and R. Maclin. Popular ensemble methods: An empirical study. J. Artif. Intell. Res., 11:169–198, 1999.
 [69] R. Polikar. Ensemble based systems in decision making. IEEE Circuits Syst. Mag., 6(3):21–45, 2006.

[70]
L. Rokach.
Ensemblebased classifiers.
Artif. Intell. Rev., 33(1):1–39, Feb 2010.  [71] L. Breiman. Bagging predictors. Mach. Learn., 24(2):123–140, 1996.

[72]
S. Geman, E. Bienenstock, and R. Doursat.
Neural networks and the bias/variance dilemma.
Neural Comput., 4(1):1–58, 1992.  [73] L. K. Hansen and P. Salamon. Neural network ensembles. EEE Trans. Pattern Anal. Mach. Intell., 12(10):993–1001, Oct 1990.
 [74] R. E. Schapire. The strength of weak learnability. Mach. Learn., 5(2):197–227, Jun 1990.
 [75] L. I. Kuncheva and C. J. Whitaker. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Mach. Learn., 51(2):181–207, May 2003.
 [76] E. Bauer and R. Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Mach. Learn., 36(1):105–139, Jul 1999.
 [77] M. A. Rohrdanz, W. Zheng, and C. Clementi. Discovering mountain passes via torchlight: methods for the definition of reaction coordinates and pathways in complex macromolecular reactions. Ann. Rev. Phys. Chem., 64:295–316, 2013.
 [78] F. Noé and C. Clementi. Collective variables for the study of longtime kinetics from molecular trajectories: theory and methods. Curr. Opin. Struc. Biol., 43:141–147, 2017.
 [79] G. PérezHernández, F. Paul, T. Giorgino, G. De Fabritiis, and F. Noé. Identification of slow molecular order parameters for Markov model construction. J. Chem. Phys., 139(1), 2013.
 [80] C. R. Schwantes and V. S. Pande. Improvements in markov state model construction reveal many nonnative interactions in the folding of ntl9. J. Chem. Theory Comput., 9:2000–2009, 2013.
 [81] A. Ziehe and K.R. Müller. TDSEP — an efficient algorithm for blind separation using time structure. In ICANN 98, pages 675–680. Springer Science and Business Media, 1998.
 [82] A. Belouchrani, K. AbedMeraim, J. . Cardoso, and E. Moulines. A blind source separation technique using secondorder statistics. IEEE Trans. Signal Process., 45(2):434–444, Feb 1997.
 [83] L. Molgedey and H. G. Schuster. Separation of a mixture of independent signals using time delayed correlations. Phys. Rev. Lett., 72:3634–3637, 1994.

[84]
B. Schölkopf, A. Smola, and K.R. Müller.
Kernel principal component analysis.
In W. Gerstner, A. Germond, M. Hasler, and J.D. Nicoud, editors, Artificial Neural Networks — ICANN’97, pages 583–588, Berlin, Heidelberg, 1997. Springer Berlin Heidelberg. 
[85]
B. Schölkopf, A. Smola, and K.R. Müller.
Nonlinear component analysis as a kernel eigenvalue problem.
Neural Comput., 10(5):1299–1319, 1998.  [86] K. R. Muller, S. Mika, G. Ratsch, K. Tsuda, and B. Scholkopf. An introduction to kernelbased learning algorithms. IEEE Trans. Neural Netw., 12(2):181–201, March 2001.
 [87] M. A. Rohrdanz, W. Zheng, M. Maggioni, and C. Clementi. Determination of reaction coordinates via locally scaled diffusion map. J. Chem. Phys., 134:124116, 2011.
 [88] F. Nüske, L. Boninsegna, and C. Clementi. Coarsegraining molecular systems by spectral matching. J. Chem. Phys., 151(4):044116, 2019.