Molecular enhanced sampling with autoencoders: On-the-fly collective variable discovery and accelerated free energy landscape exploration

12/30/2017 ∙ by Wei Chen, et al. ∙ University of Illinois at Urbana-Champaign 0

Macromolecular and biomolecular folding landscapes typically contain high free energy barriers that impede efficient sampling of configurational space by standard molecular dynamics simulation. Biased sampling can artificially drive the simulation along pre-specified collective variables (CVs), but success depends critically on the availability of good CVs associated with the important collective dynamical motions. Nonlinear machine learning techniques can identify such CVs but typically do not furnish an explicit relationship with the atomic coordinates necessary to perform biased sampling. In this work, we employ auto-associative artificial neural networks ("autoencoders") to learn nonlinear CVs that are explicit and differentiable functions of the atomic coordinates. Our approach offers substantial speedups in exploration of configurational space, and is distinguished from exiting approaches by its capacity to simultaneously discover and directly accelerate along data-driven CVs. We demonstrate the approach in simulations of alanine dipeptide and Trp-cage, and have developed an open-source and freely-available implementation within OpenMM.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 18

page 21

page 29

page 30

page 32

page 34

page 39

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The predictive capacity of molecular dynamics (MD) calculations on large biomolecules is limited by a disparity between the large time scales for conformational motions and the short periods accessible to simulation 1, 2, 3. This metastable trapping of simulations behind large free energy barriers is a consequence of rough free energy surfaces (FES) and leads to incomplete sampling of the thermally accessible phase space 1, 2, 4, 5. Comprehensive sampling of all biologically relevant system states is required to compute converged thermodynamic averages, sample the structurally and functionally relevant molecular configurations, and explore large-scale collective motions critical to biological function such as allostery, substrate gating, and ligand binding 1, 2, 3.

To confront this limitation, a plethora of enhanced sampling techniques have been developed to accelerate barrier crossing in MD simulation4, 5. These methodologies can be broadly partitioned into two categories: tempering / generalized ensemble techniques, and collective variable (CV) biasing 5. We distinguish enhanced sampling methods to perform accelerated exploration of configurational space from path-based techniques such as transition path sampling (TPS) 6, 7, 8, 9, 10, string methods 11, 12, 13, 14, and forward flux sampling (FFS) 15, 16 that quantify transition pathways and rates between predefined system states. Tempering techniques include simulated annealing 17, multicanonical algorithm 18, replica exchange 19, 20, and Hamiltonian exchange21, 22, 23, 24, 25, which modify the system temperature and/or Hamiltonian to accelerate barrier crossing5. Collective variable biasing techniques accelerate system dynamics along pre-selected CVs5, 4, and include non-Boltzmann sampling approaches such as umbrella sampling (US) 26, hyperdynamics 27, conformational flooding 28, metadynamics 29, 30, 31, adiabatic free energy dynamics (AFED)32, temperature accelerated molecular dynamics (TAMD) 33 / driven adiabatic free energy dynamics (d-AFED) 34, temperature accelerated dynamics (TAD) 35, blue moon sampling 36, 37, 38, adaptive biasing force (ABF) 39, and thermodynamic integration (TI) 40, 41. Where “good” CVs coincident with the important collective motions that separate metastable states are available, CV biasing is typically more computationally efficient than tempering approaches since it performs targeted acceleration along pre-selected order parameters5, 42. In instances where such CVs are not known or poor choices are made, CV biasing can fail badly and tempering approaches generally prove more efficient and successful. It is normally difficult to intuit good CVs for all but the simplest systems.

Machine learning presents a systematic means to discover data-driven CVs. Conceptually, these approaches posit that the molecular dynamics in the 3-dimensional space of the Cartesian coordinates of the

atoms are effectively restrained to low-dimensional “intrinsic manifold” containing the slow dynamics to which the remaining fast degrees of freedom are slaved

4, 43. The validity of this assumption has been demonstrated many times and corresponds to the emergence of a small number of collective modes from cooperative couplings between system degrees of freedom 44, 43, 45, 46, 47, 48, 49, 50, 51. CV discovery algorithms are therefore essentially dimensionality reduction or manifold learning approaches 44, 4, and the CVs are “good” in that they parameterize the intrinsic geometry of the phase space and can correspond to the slow motions of the system over the manifold 44, 45. Data-driven CV discovery suffers from a fundamental difficulty: to perform good sampling of configurational space one requires good CVs that are explicit functions of the atomic coordinates, but to discover these CVs one needs to perform good sampling of configurational space to generate data for CV discovery. This is the root of the biased sampling “chicken-and-egg problem” identified by Clementi and co-workers 4. Accordingly, data-driven CV discovery must typically be performed in an iterative fashion by interleaving rounds of accelerated sampling and CV discovery: each round of sampling provides more comprehensive sampling of configurational space, and each round of CV discovery determines collective variables parameterizing the slow dynamical motions over the space explored by the biased calculations.

Linear dimensionality reduction approaches such as principal components analysis (PCA)

52, 48, 51 and multidimensional scaling (MDS) 53 are straightforward to apply, and furnish CVs that are explicit functions of the atomic coordinates. As inherently linear approaches, however, the CVs are typically poor descriptors for complex biomolecules possessing convoluted and nonlinear intrinsic manifolds 4, 44, 45. Kernel PCA presents a means to alleviate this problem by applying a known nonlinear transformation of the atomic coordinates prior to dimensionality reduction54, but the specification of appropriate kernels can be almost as difficult as intuiting the CVs themselves 44. Nonlinear manifold learning approaches include local tangent space alignment 55, locally linear embedding (LLE) 56, 57, Isomap 49, 58, 59, 60, Laplacian eigenmaps 61, Hessian eigenmaps 62

, semidefinite embedding (SDE) / maximum variance unfolding (MVU)

59, diffusion maps 63, 45, 64, 65, 66, and sketch maps 67, 68, 69. These approaches have demonstrated great success in parameterizing nonlinear intrinsic manifolds of complex bio- and macromolecules by discovering nonlinear CVs corresponding to concerted structural motions 63, 45, 49, 50, 65, 66, 70, 71, 72. For the purposes of CV biasing, however, these approaches all share the critical deficiency in that they do not furnish the mapping of atomic coordinates to the CVs. An explicit and differentiable mapping from the atomic coordinates to the CVs is required to perform biased MD simulations in the CVs in order to propagate biasing forces in the CVs to forces on atoms 1, 4, 44, 73, 74. (Biased Monte-Carlo requires that the CVs be explicit, but not necessarily differentiable, functions.) Without this mapping, it is not possible perform biasing directly in CVs discovered by nonlinear learning.

A small number of approaches have been developed to perform indirect and approximate biasing in CVs for which the atomic mapping is unknown. We previously introduced perhaps the simplest approach that correlates CVs with candidate physical variables (e.g., dihedral angles, radius of gyration) and performs biased sampling in these proxy variables 63. High-throughput screening can sieve putative physical variables and identify those best correlated with the CVs75, 76, 77. This approach is, however, typically rather unsatisfactory, since it surrenders sampling in the CVs, and is very sensitive to the quality of the physical proxies. A related approach fits the parameters of CVs with predefined functional forms 78, but the definition of appropriate functions can almost as difficult as defining the CVs themselves. A more sophisticated approach expresses atomic coordinates as a sum of localized basis functions over a small number of “landmarks” in the CV embedding 1, 60, 73, 79, and projects new data by performing a weighted nearest-neighbors embedding 73, 79 or solving a linear optimization problem to minimize the projection error 1, 60

. Accordingly, direct sampling in the CVs is relinquished for an interpolative and approximate basis function expansion

44. An elegant strategy termed diffusion-map-directed molecular dynamics (DM-d-MD) proposed by Clementi and co-workers accelerates sampling by initializing unbiased simulations in poorly sampled regions of the CV projection 66, 72. Kevrekidis and co-workers recently proposed intrinsic map dynamics (iMapD) as an ingenious approach that first employs diffusion maps to construct a nonlinear parameterization of the local manifold, then combines boundary detection and local principal components analysis to smoothly extend the boundary and initialize short bursts of unbiased simulations in unexplored regions of configurational space 80. Both DM-d-MD and iMapD use CVs discovered by nonlinear manifold learning to define the limits of current exploration of configurational space, then judiciously initialize new unbiased simulations to drive further exploration and accelerate sampling. However, neither approach actually performs biased sampling directly in the CVs. This can make phase space exploration inefficient by presenting difficulties for surmounting high free energy barriers down which unbiased simulations can rapidly tumble.

In this work we present a new approach based on auto-associative artificial neural networks (“autoencoders”) 81, 82, 83, 84, 85 to discover nonlinear CVs and perform biasing along these coordinates. Crucially, neural networks offer a flexible and powerful means to discover nonlinear CVs and furnish a explicit and differentiable function of the atomic coordinates. These CVs may then be implemented directly in any CV biasing approach for enhanced sampling, such as umbrella sampling or metadynamics. This feature makes autoencoders uniquely suited to CV discovery for enhanced sampling, and stands in contrast to existing nonlinear machine learning approaches that require appealing to proxy variables, basis set projections, or local linear dimensionality reduction. We term our approach Molecular Enhanced Sampling with Autoencoders (MESA). In this article we introduce the theoretical and mathematical underpinnings of MESA, and demonstrate it in applications to alanine dipeptide and Trp-cage, and make the approach freely available to the community through an open source plugin to the molecular simulation package OpenMM 86, 87, 88.

The structure of this paper is as follows. In the next section, we establish the theoretical and mathematical underpinnings of MESA including autoencoders, data-augmentation, the L-method for dimensionality identification, umbrella sampling, and the weighted histogram analysis method (WHAM). In Section 3, we demonstrate our approach to discover nonlinear CVs, perform enhanced sampling in these coordinates, and efficiently recover free energy surfaces for alanine dipeptide and Trp-cage. In Section 4, we close with our conclusions and outlook for future work and development.

2 Methods

2.1 Nonlinear dimensionality reduction using autoencoders

2.1.1 Auto-associative artificial neural networks (autoencoders)

Artificial neural networks (ANN) are a collection of simple computational nodes (“artificial neurons”) linked together into a network

89, and so-called because of their similarity to biological neural networks. ANNs have risen to prominence as a flexible tool to perform nonlinear regression and classification due to their large expressive power and predictive capacity 90, with simple multilayer networks proven to be universal approximators of any nonlinear input-output relationship89, 91. In this work, we focus on a class of auto-associative ANNs known as autoencoders, which are specifically designed to perform unsupervised nonlinear dimensionality reduction and furnish explicit and differentiable functions for the nonlinear projection82, 92, 93. The structure of a prototypical 5-layer autoencoder is shown in Fig. 1. In this section we review the key mathematical details of these networks that we exploit within our CV discovery and enhanced sampling framework.

Figure 1: An auto-associative artificial neural network (autoencoder) with a 4-8-2-8-4 topology. The first half of the network passes inputs through a hidden layer into a bottleneck layer containing the low-dimensional nonlinear projection . The second half of the network passes the through a hidden layer to approximately reconstruct the inputs in the output layer . Each node sums input signals from nodes in the previous layer according to adjustable weights plus a constant bias

and passes them through an activation function to generate its own output. Training the network parameters

to accurately reproduce its inputs (i.e., to autoencode) corresponds to discovery of a low-dimensional projection in the bottleneck layer capturing the salient features of the data. Image constructed using code downloaded from http://www.texample.net/tikz/examples/neural-network with the permission of the author Kjell Magne Fauske.

Architecture. The fundamental ansatz of the autoencoder is that there exists a low-dimensional subspace – the so-called “intrinsic manifold” 45

containing most of the variance in the high-dimensional data

. Training of the autoencoder can discover both the nonlinear projection of the high-dimensional data to the intrinsic manifold and an approximate reconstruction of the high-dimensional states from their low-dimensional projections 82, 94. The architecture of the network encodes this ansatz within a five-layer, symmetric, fully-connected, feedforward topology. The projection function is encoded in the first three layers, in which an ensemble of high-dimensional data points comprising system snapshots harvested from a molecular dynamics trajectory, are fed to an input layer containing nodes, passed to the first hidden layer containing nodes, and finally projected down into a bottleneck layer comprising nodes containing the low-dimensional projections . The reconstruction function is encoded in the last three layers, in which the low-dimensional projections in the bottleneck layer are passed through the second hidden layer of nodes, and finally to the -node output layer containing the reconstructed data points . This is known as a ---- autoencoder architecture. The number of input and output nodes is specified by the dimensionality of the data. The number of hidden nodes is flexible and may be tuned via cross-validation to trade-off network expressive flexibility against complexity. Choosing often provides a good initial guess. The number of bottleneck nodes specifies the intrinsic dimensionality of the system. We present below an automated approach to ascertain an appropriate value of .

Activation functions. Denoting the input, output, and bias associated with node of layer as , , and , the weight of the connection between node in layer and node in layer as , and the activation function of layer as , we can build up an explicit mathematical mapping for . The inputs to the input layer are simply the components of the high dimensional data points . The input to any node in any other layer is the weighted sum of the outputs of all of the nodes in the previous layer plus a bias , and the output of that node is the result of activation function acting on its input . The input layer possesses linear activation functions (i.e., identity mappings) whereas the remaining layers possess nonlinear activation functions that we select to be hyperbolic tangents95. Importantly for our application, these smooth activation functions possess analytical first derivatives. Summing over the first three layers, the analytical expression for the nonlinear projection is,

(1)

and for the nonlinear reconstruction is,

(2)

Training. Training of the autoencoder amounts to training the network to optimally reconstruct its own inputs (i.e., to autoencode). Mathematically, this corresponds to specifying the network weights and biases to minimize the error function,

(3)

The first term is the reconstruction error, and the second term is a regularizing penalty known as weight decay that is typically included to prevent overfitting by controlling the growth of the network weights 82, 95. In this work, we choose to set for all layers to eliminate the weight decay term and instead regularize network training using early stopping as an approach that plays essentially the same role in guarding against overfitting without the need to tune the hyperparameters 96, 97

. Early stopping is implemented by randomly selecting 80% of data as the training partition and remaining 20% as the validation partition, and terminating training when we either reach the maximum number of training epochs or the error function evaluated over the validation partition no longer decreases for 30 continuous epochs (i.e., early stopping). In each epoch, neural network parameters are updated using mini-batch stochastic gradient descent with momentum.

89, 98, 99, 100.

Dimensionality determination. The number of nodes in the bottleneck layer specifies the dimensionality of the nonlinear projection. We empirically determine the dimensionality of the intrinsic manifold from the data by training ---- autoencoder architectures, where is tuned via cross-validation 95. An appropriate value of is determined by computing the fraction of variance explained (FVE) by the reconstructed outputs,

(4)

where is the residual sum of squares, is the total sum of squares, and

is the mean input vector. The emergence of a plateau or knee in the FVE curve at a particular value of

indicates that the predominance of variance in the data exists within a -dimensional intrinsic manifold. We determine the location of the knee in an automated fashion using the L-method of Salvador and Chan 101. For and a high FVE in excess of 70%, we assert that the autoencoder has discovered that the high-dimensional data in admits a low-dimensional representation in from which the original data can be approximately reconstructed. The outputs of the bottleneck layer in the trained autoencoder define the CVs parameterizing the low-dimensional intrinsic manifold. The explicit mapping of the high-dimensional input data to these CVs is furnished by the projection in Eqn. 1. Provided the activation functions of all of the nodes in the network are smooth functions, then this explicit mapping also possesses analytical first derivatives with respect to the input data.

2.1.2 Elimination of translational and rotational invariances

Our application of autoencoders is to learn nonlinear low-dimensional CVs describing the important configurational motions of biomolecules. The training data comprises snapshots of the molecule harvested from molecular dynamics simulations recording the Cartesian coordinates of the constituent atoms. When studying the configurational motions of a molecule in an isotropic medium, we typically wish to resolve internal structural reconfigurations and exclude trivial changes in rotational orientation or center of mass location. Accordingly, we must properly treat the to eliminate rotation and translation. It is straightforward to eliminate translational degrees of freedom by subtracting the center of mass coordinates from each

prior to passing these data to the autoencoder. We eliminate rotational invariances using data augmentation approach that is commonly used in deep learning to eliminate invariances and enhance training sets, and which has proved particularly valuable in image classification applications

102. For each translationally-centered molecular configuration , we generate an ensemble of additional configurations constructed by applying random three-dimensional rotations to . In training the autoencoder we then assert that the network learn that all of these inputs should “autoencode” to the same output, thereby teaching the network to disregard rotations as a relevant degree of freedom and discover CVs that are a function only of the internal molecular configuration. Mathematically, this requires that we modify the error function defined in Eqn. 3 to,

(5)

where denotes a rotation of configuration by a randomly selected angle around a randomly selected axis in 3D space, and represents the optimal rotational alignment of rotated configuration to a reference configuration that has been previously translated such that its center of mass is coincident with the origin. The operation can be efficiently performed using the Kabsch algorithm 103. Training of the autoencoder weights and biases to minimize Eqn. 5 teaches the network to map arbitrary rotations of a particular configuration to the alignment of that configuration to a specific reference structure. A deficiency of this approach is that we must select a particular reference configuration and this selection can affect the particular set of trained weights and biases. We observe that bias introduced by the choice of reference state can be mitigated and the autoencoder made robust to this particular choice by employing reference configurations , each of which is reconstructed by a subset of the outputs of an augmented terminal layer. For example, we show in Fig. 2 an autoencoder with 4-8-2-8-(24) topology employing two reference configurations. The network is identical to that in Fig. 1, except that approximations to each of the two reference configurations are computed in the terminal layer. We then define the error function as the sum of the errors associated with each reference structure,

(6)

allowing the weights and biases of the output layer to differ for each of the reference configurations. In this manner, we confine the effect of different reference structure rotations to the parameters of the terminal layer, and the weights and biases of the first four layers are trained to be invariant to the choice of reference structure. In practice, we find satisfactory performance for autoencoders trained with =2 reference structures and =16 random rotations.

Figure 2: An autoencoder with a 4-8-2-8-(24) topology). The network topology is identical to that in Fig. 1, with the exception that the first four outputs of the output layer are used to compute the reconstruction error with respect to the first reference configuration, and the second four outputs with respect to the second reference configuration. The network is trained to minimize the sum of reconstruction errors over the two reference configurations. Image constructed using code downloaded from http://www.texample.net/tikz/examples/neural-network with the permission of the author Kjell Magne Fauske.

2.2 Enhanced sampling in autoencoder CVs

2.2.1 Umbrella sampling

The principal advantage of autoencoders over competing nonlinear dimensionality reduction techniques is that the mapping of the high dimensional data – here molecular configurations – to the intrinsic manifold spanned by the discovered CVs is an explicit and differentiable function of the input coordinates given by Eqn. 1

. Accordingly, biasing forces on the CVs can be analytically propagated to biasing forces on the atoms by repeated application of the chain rule. Enhanced sampling can then be conducted directly in the discovered CVs using any of the collective variable biasing techniques listed in Section

1. In this work we choose to use umbrella sampling, but other popular approaches such as metadynamics or adaptive biasing force calculations could be straightforwardly employed.

Umbrella sampling (US) was first introduced by Torrie and Valleau in 1977 26. The method proceeds by augmenting the configurational part of the classical Hamiltonian written in the coordinates of all constituent atoms with a (typically harmonic) restraining potential in a set of predefined collective variables . These potentials confine the system to a narrow region of CV space that looks locally flat, and within which the system can comprehensively sample the acessible configurational space104, 105. By tiling CV space with a number of overlapping umbrella windows, free energy barriers can be surmounted and the full extent of CV space completely explored. Success of the approach depends crucially on the selection of “good” CVs that track the slow collective molecular motions of the system and adequately discriminate the metastable states and resolve the free energy barriers within the system.

The bias-augmented system Hamiltonian determines the force experienced by each atom through the first derivatives with respect to the atomic position ,

(7)

where we identify as the unbiased forces on the atoms arising from the system Hamiltonian in the absence of the biasing potential, and as the artificial biasing forces on the atoms arising from the potential applied to the CVs. In performing a biased molecular dynamics simulation, the unbiased forces are computed from the molecular force field in exactly the same manner as in an unbiased calculation. These are supplemented by the biased forces derived from the artificial restraining potential applied to the CVs. We now proceed to derive analytical expressions for the biasing forces. In Section 2.5 we describe our computational implementation within biased molecular dynamics simulations.

We first recognize that the furnished by the bottleneck layer depend on the inputs to the autoencoder through the map defined in Eqn. 1. In applications of nonlinear machine learning to biomolecular folding it is conventional to treat the solvent atom coordinates implicitly through their influence on the conformational ensemble explored by the biomolecule, but it is a possible extension of this work to explicitly consider solvent degrees of freedom 44. Accordingly, the autoencoder inputs comprise a list of the Cartesian coordinates of the atoms constituting the biomolecule as a subset of the coordinates of all atoms in the system . Armed with these nested dependencies, we may expand as a product of three Jacobian matrices resulting from repeated application of the chain rule,

(8)

where is a 1-by- matrix, is a -by- matrix, and is a -by-3 matrix 1. Each of these three matrices possesses straightforward analytical expressions.

Assuming the standard choice of a harmonic biasing potential in each of the CVs, the artificial biasing potential can be written as,

(9)

where is the harmonic force constant and the harmonic center for CV . The elements of the Jacobian matrix immediately follow as,

(10)

Appealing to Eqn. 1, the elements of this Jacobian matrix follow from the low-dimensional projection furnished by the trained autoencoder,

(11)

where the and are the weights an biases of the trained network, and are the activation functions associated with the second and third layers, and and are their first derivatives .

Since the constitute a subset of the atomic Cartesian coordinates, this final Jacobian matrix possesses a very simple form. For the -by-3 matrix contains three unit elements for , for , and for , with all other entries zero. For , all matrix elements are zero. Accordingly, this matrix serves as a filter that passes forces onto only those atoms whose coordinates feature in the inputs to the autoencoder.

2.2.2 Weighted histogram analysis method (WHAM)

Having conducted a series of umbrella sampling simulations tiling CV space, we estimate the unbiased free energy surface

in these CVs by reweighting and combining the biased histograms recorded in each window using the weighted histogram analysis method (WHAM) 106, 107, 108, 109, 110

. The WHAM estimator minimizes statistical errors in the estimated unbiased probability distribution

107, 108, 109, 110, 106, 111, 112, 113, 114, and the free energy surface is then computed from the estimated unbiased distribution using the standard statistical mechanical relation115,

(12)

where is an arbitrary additive constant reflecting our ignorance of the absolute free energy scale. Free energy surfaces in (collective) variables other than those in which biased sampling was conducted are estimated from the biased simulation data and WHAM solution using the approach detailed in Ref. [63].

2.3 MESA: Interleaved on-the-fly CV discovery and enhanced sampling

We now present our novel framework for nonlinear CV discovery and accelerated sampling that we term MESA. The heart of the approach is discovery of CVs that are explicit and differentiable functions of the atomic coordinates that can then be used to perform enhanced sampling by surgical targeting of biased calculations in these collective coordinates. Having conducted additional sampling of the space, the data driven CVs spanning this expanded space will, in general, differ from those parameterizing the original space and we must retrain our autoencoder including the new data. Furthermore, the dimensionality of the system may change with location in configurational space, and new CVs can emerge as we conduct additional sampling 63, 80. This is the origin of the “chicken-and-egg problem” identified by Clementi and co-workers 4, and requires that we interleave successive rounds of CV discovery and biased sampling until the discovered CVs and the free energy surface they parameterize stabilize. This informs a six-step iterative protocol for the application of MESA, for which a schematic flowchart is provided in Fig. 3. We observe that the computational cost associated with autoencoder training is vastly lower than that of the biased molecular simulations for all but the most trivial of molecular systems. For example, execution of 15 2 ns biased simulations of the 20-residue Trp-cage mini-protein 116 in a bath of 2773 explicit water molecules required 8 GPU-hours, whereas training of an autoencoder over the 32,000 harvested molecular configurations required only 5 GPU-minutes.

Figure 3: Schematic flowchart of the iterative six-step application of MESA.

Step 1 – Generate initial training data. Perform an initial unbiased simulation to generate an initial set of atomic coordinates for the first round of autoencoder training. The CV discovery process performs better for larger data sets that explore larger volumes of configurational space, and so it is typically beneficial to conduct relatively long unbiased calculations. As a rule of thumb, the initial unbiased run is sufficiently good for initial CV discovery if the fraction of variance explained by the reconstructed outputs exceeds 40% (Eqn. 4

). For systems possessing very rugged free energy surfaces, it can be useful to conduct this unbiased run at elevated temperature, employ a biased calculation in intuited CVs based on expert knowledge, or run calculations of a related but less computationally expensive model such as a biomolecule in vacuum or implicit solvent or a coarse-grained analog as a form of transfer learning

1, 117.

Step 2 – Autoencoder CV discovery. Train an ensemble of ---- autoencoders with different values of corresponding to various numbers of bottleneck nodes. is tuned at each value of by cross-validation. Perform training over all molecular configurations collected in all previous rounds of biased and unbiased sampling using the data-augmented training procedure described in Section 2.1.2 in which we eliminate translational and rotational invariances. It is typically good practice to train a number of autoencoders at each value of by initializing each with different randomly assigned weights and biases and selecting the top performed measured in terms of fraction of variance explained in the reconstructed outputs (Eqn. 4). Apply the L-method101 to identify a knee in the FVE plot to determine the dimensionality of the intrinsic manifold and select an appropriate value of .

Step 3 – Boundary detection. Map all molecular configurations to the low-dimensional intrinsic manifold using the trained autoencoder. Employ a boundary detection algorithm to define the -dimensional boundary of the explored region of the -dimensional manifold. Possible approaches include alpha-shapes 80, 118, 119 or “wrapping” algorithms120. Here we employ a grid-based approach motivated by the normalized graph Laplacian that discretizes CV space to identify cells that are both sparsely populated and adjacent to densely populated regions.

  1. Divide the space spanned by the collected data into a cubic grid comprising cells with side lengths . The size of the cells control how finely we resolve the boundary and how far we extrapolate beyond the explored region. The optimal values are system dependent, but we find the algorithm to be quite robust to the particular choice. As a rule of thumb, we find taking equal to of the linear range of the exploration in provides satisfactory results. Collect histograms over this grid and let denote the number of points falling within cell .

  2. Compute for each cell. This operation applies a Laplacian / exponential kernel with unit variance to each cell count, which has the effect of amplifying difference between empty cells and non-empty cells.

  3. Let be the set containing indexes of neighboring cells of cell . Compute . This operation measures the difference in the population of each grid cell with the mean population of its neighbors.

  4. Discard all cell indexes with positive values. Sort the remaining and select the most negative values (i.e., largest absolute values) as the cells in which to launch biased umbrella sampling runs. In this work we select =10-20, but this can be tuned based on available computational resources.

We find this procedure to perform quite well in both identifying the boundary of the explored region of the intrinsic manifold, and identifying internal holes within this region. It is also quite robust to the shape and dimensionality of the explored region. An illustration of the boundary detection procedure is illustrated in Fig. 4.

Figure 4: Illustration of boundary detection procedure for a two-dimensional intrinsic manifold parameterized by CVs and . (a) Bin counts of molecular configurations projected into each cell of a cubic histogram spanning the explored region of the intrinsic manifold. (b) Application of Laplacian kernel to each bin count, . (c) Differencing the Laplacian kernel of each cell count with the mean of its neighboring cells , . (d) Selection of those cells containing the = 19 most negative values as those in which to launch umbrella sampling runs.

Step 4 – Enhanced sampling. Run umbrella sampling runs in each of the identified boundary cells employing harmonic restraining potentials given by Eqn. 9. This step both advances the frontier of the explored region and fills internal holes by direct biasing in the identified CVs. In this work we select the umbrella potential center to be the center of the identified cubic cells, but it is straightforward to place the biased runs closer or further from the boundary. Appropriate values for the harmonic restraining potentials are strongly system dependent, and must typically be tuned by trial and error. For systems possessing high free energy barriers, we suggest that it may be useful to adaptively tune based on the distance between the center of the umbrella sampling point cloud and the potential center. Specifically, if the center of mass of the point generated by the biased calculation falls outside of the grid cell, the harmonic force constant may be progressively increased up to a pre-specified maximum. In this manner, the biased simulation can adapt to the local gradient and topography of the underlying free energy landscape 80.

Step 5 – Convergence assessment. Determine whether the umbrella sampling runs have led to substantial new exploration of configurational space. We test this in two different ways. First, we project the new molecular configurations onto the intrinsic manifold using the autoencoder and determine whether the boundary of the explored region has changed after the addition of the new data. Second, we perform Step 2 to retrain autoencoders on the augmented data and identify whether the dimensionality of the intrinsic manifold and its parameterizing CVs have converged. The dimensionality of the intrinsic manifold is assessed by application of the the L-method101

to identify a knee in the FVE plot. Stabilization of the CVs is assessed by performing a linear regression of the old CVs in terms of the new, and looking for coefficients of determination in excess of 95%. This procedure allows for trivial linear transformations between the old and new CVs – sign changes and rotations – that do not affect the information content of the parameterization. If the boundary has not moved, the dimensionality has not changed, and the CVs have stabilized, then we terminate MESA. Otherwise, we perform a new round of interleaved CV discovery and biased sampling by cycling round the loop again from Step 2.

Step 6 – Compute free energy surface. Perform umbrella sampling over the complete space spanned by the converged CVs specified by the terminal autoencoder. Construct the optimal estimate of the unbiased free energy surface in the collective variables by solving the WHAM equations, and optionally reweighting the unbiased free energy surface into arbitrary collective coordinates (Section 2.2.2).

This algorithm possesses a number of attractive features. First, all steps are largely automated requiring minimal user intervention, including training of the autoencoder, boundary detection, and identification and launch of new umbrella runs. The requirement of the user beyond specifying the parameters of the algorithm is in Step 5 to judge whether the boundary and/or CVs are sufficiently similar to warrant termination of MESA. In principle, this step could also be automated. Second, by interleaving CV discovery and enhanced sampling, we converge towards a set of global data-driven CVs providing a good parameterization of the accessible configurational space. Third, the availability of the explicit mapping from the atomic coordinates to the CVs enables enhanced sampling directly in the CVs. This stands in contrast to existing approaches that initialize unbiased simulations at the edge of the explored region in CV space to drive sampling of unexplored configurational space 80, 66, 72 and/or perform sampling in proxy physical variables63 or in an approximate basis function expansion 1, 60, 73, 78, 79. Fourth, the collective coordinates are useful not only for enhanced sampling, but in providing understanding of the important slow collective motions governing the long-time evolution of the system. The explicit mapping between the atomic coordinates and CVs can help provide a transparent physical interpretation of these collective modes. Although these relations can be somewhat obscured by the complexity of the mapping function (Eqn. 1), the weights within the trained network can be quite informative in illuminating which atoms play important roles in each of the discovered CVs.

2.4 Molecular dynamics simulations

We demonstrate and validate MESA in applications to two peptides: alanine dipeptide and Trp-cage (Fig. 5). All molecular simulations were performed using the OpenMM 7.0 molecular dynamics package 86, 87, 88.

Figure 5: Molecular structures of the two peptides studied in this work. (a) Alanine dipeptide indicating the and backbone dihedrals. (b) Trp-cage indicating the N-terminal -helix (blue), helix (green) and C-terminal polyproline region (red). The orange arrows denote the vectors linking the atoms of Asn-1 (A), Asp-9 (B), Ser-14 (C), and Ser-20 (D). The dihedral angle defined by these vectors measures the angle between the planes containing the atoms of {Asn-1, Asp-9, Ser-14}, and that containing those of {Asp-9, Ser-14, Ser-20}. This angle provides a measure of the global chirality of the molecule that is informative in understanding the CVs discovered by MESA. The Trp-6 side chain (explicitly visualized) is “caged” inside a hydrophobic core. All molecular renderings are constructed using VMD 121.

2.4.1 Alanine dipeptide

Simulations of alanine dipeptide (N-acetyl-L-alanine-N-methylamide, AcAlaNHMe) were performed in vacuum using the Amber99sb force field 122. Calculations were performed at =300K and the equations of motion numerically integrated using a Langevin integrator with a friction coefficient of 1 ps and a 2 fs time step 123. All bond lengths were fixed using the LINCS algorithm 124. Lennard-Jones interactions were treated with no cutoff, and we employed the Lorentz-Berthelot combining rules to determine interaction parameters between dissimilar atom types 125. Electrostatic interactions were computed exactly in real space using no cutoff. A single 800 ps initial unbiased calculation was performed, and snapshots saved every 1 ps to generate an initial unbiased ensemble of 800 molecular configurations. Enhanced sampling calculations were performed using an in-house biasing force plugin to OpenMM to launch =10-20 umbrella windows in each round employing harmonic restraining potentials of =3000 kJ/mol.(unit of CV) in each CV. Each umbrella calculation was performed for 100 ps, saving configurations every 1 ps. All calculations were performed on Intel i7-5820K chips (6-cores, 15 MB Cache, 3.8 GHz overclocked), achieving execution speeds of 6 s/day.core for unbiased simulations and 1.3 s/day.core for biased calculations.

2.4.2 Trp-cage

Simulations of Trp-cage (NLYIQWLKDGGPSSGRPPPS; PDB ID: 1L2Y) were performed in a bath of 2773 TIP4P-Ew water molecules 126 treating the protein using the Amber03 force field 122. A single Cl counterion was added to neutralize the net single positive charge of Trp-cage. Initial molecular configurations were downloaded from the Protein Data-Bank 116, 127. Calculations were conducted at =300 K using an Andersen thermostat 128 and =1 atm using a Monte-Carlo barostat129, 130. Equations of motion were numerically integrated using the leapfrog Verlet algorithm with a 2 fs time step 131. The length of all bonds involving H atoms were fixed using the LINCS algorithm 124. Lennard-Jones interactions were subjected to a 1 nm cutoff, and Lorentz-Berthelot combining rules used to determine interaction parameters between dissimilar atom types 125. Electrostatic interactions were treated using particle mesh Ewald with a 1 nm cutoff 132. Three independent 10 ns unbiased simulations were conducted saving snapshots every 20 ps, to generate an initial unbiased ensemble of 1500 molecular configurations. Enhanced sampling calculations were performed using an in-house biasing force plugin to OpenMM to launch =10-20 umbrella windows in each round employing harmonic restraining potentials of a minimum of =3000 kJ/mol.(unit of CV) in each CV. Higher force constants were employed as necessary to achieve good sampling in rugged or steep regions of the landscape. Each umbrella calculation was performed for 2 ns, saving configurations every 20 ps. All calculations were performed on GeForce GTX 960 GPU cards, achieving execution speeds of 100 ns/day.core for the unbiased simulations and 95 ns/day.core for the biased calculations.

2.5 Implementation and open source availability

All molecular simulations were performed using the open source OpenMM 7.0 simulation package available for free download from http://openmm.org 86, 88

. All artificial neural network construction and training was conducted using the Keras deep learning Python API available from

https://github.com/fchollet/keras

and running on top of the Theano libraries

133. Umbrella sampling calculations were performed using an in-house biasing force plugin ANN_Force to OpenMM that computes all of the Jacobian matrix elements in Eqn. 2.2.1 from a trained autoencoder, and calculates from these the atomic biasing forces for any given molecular configuration. We have implemented this plugin for both the CPU and CUDA platforms of OpenMM.

We have made all of our software implementations available for free and open source public download under the MIT License. The ANN_Force biasing force plugin to OpenMM is available from https://github.com/weiHelloWorld/ANN_Force. The enhanced sampling framework to train autoencoders, perform boundary detection, deploy and run biased umbrella runs, and post-process the biased simulation data is available from https://github.com/weiHelloWorld/accelerated_sampling_with_autoencoder.

3 Results and Discussion

We now discuss the application of MESA to two test peptides – alanine dipeptide and Trp-cage – to demonstrate, validate, and benchmark the approach (Fig. 5).

3.1 Alanine dipeptide

Alanine dipeptide is a canonical and well-studied test system for new biomolecular methods 134, 63, 66, 72, 135, 136, 137, 138 (Fig. 5a). It possesses a 2D intrinsic manifold that is conventionally parameterized by the and backbone dihedrals and lies on the surface of a flat torus 134, 139, 66, 140. The free energy surface supported by this intrinsic manifold is very well studied, and in vacuum supports a landscape containing three local minima corresponding to the , , and states 139, 66. Interconversion between the and states occurs on the order of 50 ps, but transitions to the state are around 100-fold slower making efficient exploration and recovery of converged free energy surfaces a good test of our approach 66. Unbiased and biased simulations of alanine dipeptide in vacuum were conducted as detailed in Section 2.4.1.

Generation of initial unbiased data (Step 1). We commenced application of MESA by conducting a 800 ps unbiased simulation initialized from the state. As illustrated by the Ramachandran plot in Fig. 6, this long unbiased simulation comprehensively explores the and local minima, but fails to transition even once over the high free energy barriers separating them from the state.

Figure 6: Projection into - space of the initial 800 ps unbiased simulation trajectory used for the initial round of CV discovery. The underlying free energy landscape generated by direct umbrella sampling in - space is plotted for reference. Contours of free energy are plotted in 2 increments, where is the reciprocal temperature, is Boltzmann’s constant, and =298 K. The unbiased trajectory (red points) initialized from the state rapidly explores the upper left corner containing the and basins, but fails to transition into the state.

Iterative CV discovery and sampling (Steps 2-5). We then performed 10 rounds of interleaved CV discovery and enhanced sampling using MESA. Collective variable discovery was performed by training ---- autoencoders over =21-dimensional input data z comprising the Cartesian coordinates of the seven backbone atoms of alanine dipeptide. We employ two randomly selected reference configurations in our error function (Eqn. 6), such that the output data are =42-dimensional. The number of nodes in the hidden layers was specified to be =40 by cross validation. The number of bottleneck layer nodes in each iteration of MESA was determined by ascertaining the dimensionality of the intrinsic manifold by plotting the fraction of variance explained as a function of (Eqn. 4). As illustrated in Fig. A1 in the Appendix, we find a knee at =2 in all 10 iterations, motivating the use of two bottleneck nodes in each loop of MESA. It is known that the intrinsic manifold of alanine dipeptide is 2D, so it is encouraging that our automated approach can accurately ascertain this from the data. Furthermore, it is known that the periodicity in the and dihedrals parameterizing the manifold endow it with the topology of a flat torus 134. As carefully investigated by Hashemian and Arroyo 134, topological periodicities in the intrinsic manifold can lead to inefficiencies in sampling and difficulties in CV determination. We shall see that these periodicities do not cause any such problems for MESA, but do introduce artifacts into the free energy surface. We quantify these effects below and present a protocol for their elimination.

We present in Fig. 7 the results of the 10 MESA iterations illustrating the expansion of the exploration of the 2D intrinsic manifold, stabilization of the CVs and explored region, and projection of sampling points into - space. The explored region of the intrinsic manifold converges within 10 iterations, with the frontier of the intrinsic manifold and CVs stabilizing between iterations 9 and 10. The projections into - space show that the system is driven to sample the basin by iteration 6, indicating the ability of our approach to discover important collective motions and drive sampling in these collective coordinates.

Figure 7: Application of 10 rounds of MESA to alanine dipeptide in vacuum. Column 1 presents projections onto the intrinsic manifold of the current autoencoder spanned by all molecular configurations collected in the current iteration of biased sampling and all previous rounds of biased sampling. Points are colored by the dihedral angle. Crosses represent harmonic centers of umbrella sampling runs to be deployed in the current iteration that were identified by the boundary detection algorithm to advance the frontier of the explored region. Column 2 is identical to column 1, except points are colored by the dihedral angle. Column 3 presents Ramachandran plots containing the projection of molecular configurations into - space. MESA rapidly drives sampling of the configurational space, and converges within 10 iterations. The biased sampling conducted in iteration 10 barely advances the frontier relative to iteration 9, and the CVs and explored regions are nearly identical up to a linear transformation.

Inspection of the first two columns of Fig. 7 indicates that there is no simple monotonic relationship between the autoencoder discovered CVs and the backbone dihedrals known to provide a good parameterization of the intrinsic manifold. The source of this apparent discrepancy is the inherent periodicity in these two dihedral angles, which prevents the emergence of a simple bijective mapping with the autoencoder CVs. Close inspection of Fig. 7 reveals the emergence of closed loops in space corresponding to wrapping of the enhanced sampling through the periodic boundary. Accordingly, the intrinsic manifold spanned by - actually represents a projection of the topologically closed flat torus upon which the alanine dipeptide intrinsic manifold resides 134. Indeed, the flat torus can only be embedded in four dimensions or higher 141. Mathematically, four dimensions are required to preserve the topological structure of the flat torus and define an invertible map such that every point on the surface of the torus is uniquely located within the projection and the object is fully unfolded in the embedding without artificial intersections 141, 142, 143. We should therefore expect projections into lower dimensional spaces to contain regions where distant points on the flat torus are projected to identical points in the low-dimensional embedding 134. Such artifacts are visually apparent in the last row of Fig. 7, where the upper right quadrant of the blue 1 rad loop containing configurations with [ 1 rad, 2 rad] appears to lie on top of configurations with [ -2.5 rad, 0 rad]. In the present application these artifacts do not appear to be so severe as to introduce any difficulties for CV determination or enhanced sampling using MESA, but we do observe that they can impede sampling efficiency by causing exploration to become stuck within these loops for a few iterations before MESA is able to drive sampling in the other degree of freedom.

Free energy surface (Step 6). Having converged MESA, we collate all simulation snapshots collected over the 10 iterations to train a terminal autoencoder that will be used to perform umbrella sampling over the complete CV space. An analogous plot to Fig. 7 for this terminal autoencoder is presented in Fig. 8.

Figure 8: Projections of all molecular snapshots harvested over the course of the 10 MESA iterations into the 2D intrinsic manifold spanned by the CVs identified by the terminal autoencoder. (left) Projection into the 2D intrinsic manifold colored by the dihedral angle. (center) Projection into the 2D intrinsic manifold colored by the dihedral angle. (right) Ramachandran plot projection into - space.

Using this terminal autoencoder, we conduct 200 umbrella sampling calculations applying biasing potentials to tile the intrinsic manifold spanned by the CVs . We then analyze these biased trajectories to estimate the unbiased free energy surface by solving the WHAM equations (Section 2.2.2). We present in Fig. 9a the free energy surface in CV space and in Fig. 9b the estimated uncertainties in the landscape computed by 10 rounds of bootstrap resampling with replacement. In Fig. 9c-d we present analogous plots for and its estimated uncertainties computed by reweighting of our biased simulation data collected in . As a point of comparison, we present in Fig. 9e-f an estimate of and its uncertainties computed by performing umbrella sampling directly in the dihedral angles as a ground truth estimate of the free energy surface. Finally, we present in Fig. 9g a plot of the difference between the optimally aligned landscapes computed by reweighting umbrella sampling data in (Fig. 9c) and direct umbrella sampling in (Fig. 9e). Analysis of Fig. 9 reveals that MESA ably resolves the basins corresponding to the three metastable states , , and , but Fig. 9g shows that the shape, location and depth of the basin differs from that recovered from direct umbrella sampling in , and a spurious metastable minimum is resolved at [ -2.5 rad, -0.75 rad]. These artifacts in the free energy surface are a direct result of the periodicities in the underlying intrinsic manifold that populates the surface of a flat torus and cannot be correctly embedded in a 2D projection 134.

Figure 9: Comparison of free energy surfaces of alanine dipeptide in vacuum obtained from 2D biased sampling in MESA CVs and direct sampling in the dihedrals. (a,b) Free energy surface and associated estimated uncertainties computed by biased sampling in MESA identified CVs. Free energies are reported in units of , where , is Boltzmann’s constant and =298 K. (c,d) Free energy surface and associated uncertainties computed by reweighting the biased simulation data collected in . (e,f) Free energy surface and associated uncertainties computed by conducting umbrella sampling directly in the dihedral angles . Uncertainties in the landscapes are computed by 10 rounds of bootstrap resampling with replacement. (g) Free energy difference landscape formed by optimally aligning and then subtracting the landscape in (e) from that in (c). Biased sampling in the MESA CVs misidentifies the location, shape, and depth of the well, and identifies a spurious metastable minimum at [ -2.5 rad, -0.75 rad].

A straightforward way to solve the projection problem with periodic variables is to simply increase the dimensionality of the autoencoder nonlinear projection by increasing the number of bottleneck layer nodes. The Whitney Embedding Theorem states that any dimensional manifold can be embedded (i.e., fully unfolded without spurious crossings) in for = (2+1) 142. The theorem specifies the embedding dimensionality at or above which we are guaranteed to achieve a proper embedding, but depending on the topology of the manifold it may be possible to employ (2+1). Accordingly, if we have prior knowledge that the intrinsic manifold of our system may contain periodicities, or we observe the emergence of closed loops in the CVs identified by MESA (e.g., Fig. 7), then we propose that the number of hidden nodes be increased to (2+1), where is the dimensionality of the intrinsic manifold identified by the FVE approach described in Section 2.1.1

. Increasing dimensionality does come with increasing computational cost since it is more expensive to train the autoencoder (since the number of network parameters increases), run biased calculations (since the analytical expression for the nonlinear projection becomes more complicated), and conduct umbrella sampling (due to the curse of dimensionality). It is also more challenging to interpret and visualize data high-dimensional spaces. In practice, therefore, we suggest that the number of hidden nodes may be increased one at a time and the dimensionality increase terminated when projections of the manifold cease to depend on

.

In the present case, we know that is sufficient to embed the flat torus containing the intrinsic manifold of alanine dipeptide 134, and so we train a new terminal autoencoder with =4 bottleneck nodes over the collated data from all 10 rounds of MESA conducted using =2 topologies. We then performed 1000 umbrella sampling simulations to tile the 4D space spanned by the autoencoder CVs and recover an estimate of the unbiased FES by solving the WHAM equations. In Fig. 10a we present computed by reweighting the the umbrella sampling calculations, and in Fig. 10b the estimated uncertainties computed by 10 rounds of bootstrap resampling with replacement. In Fig. 10c, we present the difference between Fig. 10a and that recovered by direct umbrella sampling in (Fig. 9e). By fully expanding the flat torus into 4D space, we eliminate the projection problem associated with periodic variables. MESA ably identifies four CVs parameterizing the embedding of the flat torus, and projection of the 4D free energy surface into - space shows excellent agreement with that recovered by direct sampling in these dihedral angles.

Figure 10: Comparison of free energy surfaces of alanine dipeptide in vacuum obtained from 4D biased sampling in MESA CVs and direct sampling in space. (a,b) Free energy surface and associated uncertainties computed by reweighting the biased simulation data collected in . Free energies are reported in units of , where , is Boltzmann’s constant and =298 K. (c) Free energy difference landscape formed by optimally aligning and then subtracting the landscape in Fig. 9e from that in (a). Sampling in MESA identified CVs in 4D fully unfolds the intrinsic manifold of alanine dipeptide, and the projected FES into - shows excellent agreement with that recovered by direct sampling in these dihedral angles.

3.2 Trp-cage

Trp-cage (PDB ID: 1L2Y) is a 20-residue mini-protein 116 with a 4 s folding time and a native state containing an -helix, a -helix, polyproline II helix, and salt bridge that “cage” Trp-6 in a hydrophobic core 144, 145, 146, 147, 148 (Fig. 5b). A fast-folding mini-protein containing secondary structural elements present in large proteins, Trp-cage has been extensively studied by both computation144, 149, 148, 150, 151, 152, 153, 154, 155, 156, 157, 158 and experiment145, 159, 160, 161, 162, 163 as the “hydrogen atom of protein folding” 147, 164. In particular, Jurazek and Bolhuis identified two distinct folding pathways using transition path sampling 148. Deng et al. used Markov state models to analyze an long 208 s simulation to resolve similar patterns in overall structure changes and folding routes 156. Kim et al. ran dozens of folding simulations starting from unfolded states, then projected these trajectories into a 2D collective variable space identified by diffusion maps144. Both Deng et al. and Kim et al. achieved good agreement of their estimated folding times with experiment, demonstrating the ability of MD simulations to quantitatively recapitulate biophysical folding processes. We employ Trp-cage as a more realistic and challenging test case for MESA. Simulations of Trp-cage in explicit water were conducted as detailed in Section 2.4.2.

Generation of initial unbiased data (Step 1). We ran three independent 10 ns unbiased simulations commencing from the Trp-cage native state downloaded from the Protein Data-Bank (PDB ID: 1L2Y) 116, 127. Snapshots were saved every 20 ps to generate an initial unbiased training set of 1500 molecular configurations. These data achieve a maximum root mean squared deviation (RMSD) of the atoms relative to the native state of 3.5 Å, and the first iteration of MESA results in an autoencoder with an FVE 60%. Accordingly, these unbiased data provide a reasonable starting ensemble for iterative application of MESA.

Iterative CV discovery and sampling (Steps 2-5). We then performed six rounds of interleaved CV discovery and enhanced sampling using MESA before achieving convergence. Collective variable discovery was performed by training ---- autoencoders over =180-dimensional input data z comprising the Cartesian coordinates of the 60 backbone atoms of Trp-cage. We again used two randomly selected reference configurations in our error function (Eqn. 6), so that the output data are 2=360-dimensional. The number of nodes in the hidden layers were tuned to =50 by cross-validation. The number of bottleneck layer nodes in each iteration of MESA by plotting the FVE as a function of using Eqn. 4. As illustrated in Fig. A2 in the Appendix, we find a knee at =2 in all 6 iterations, supporting the use of two bottleneck nodes in each iteration of MESA. This estimated dimensionality of the Trp-cage intrinsic manifold is consistent with prior work 148, 144, 154.

We present in Fig. 11 the results of the six MESA iterations illustrating accelerated sampling and exploration of the intrinsic manifold. We project the sampled molecular configurations into the 2D intrinsic manifold spanned by the two MESA CVs discovered in each iteration and colored according to three physical order parameters to illuminate the configurational diversity in the accelerated sampling: the RMSD of the atoms relative to the native state, the end-to-end extent of the molecule measured between the atoms in Asn-1 and Ser-20, and the sine of the non-local dihedral angle formed by the atoms in Asn-1, Asp-9, Ser-14, and Ser-20. As illustrated in Fig. 5b, measures the signed angle between the planes containing the atoms of {Asn-1, Asp-9, Ser-14}, and that containing those of {Asp-9, Ser-14, Ser-20}. -RMSD measures the departure of the configurational sampling from the native state, measures the linear extent of the peptide chain, and measures the global chirality of the molecule: indicates that the N-terminal and C-terminal strands are arranged such that the molecule adopts a right-handed chirality, that they adopt a left-handed chirality, and =0 that the N-terminal and C-terminal strands lie in the same plane. Convergence of the frontier of the intrinsic manifold, volume of configurational space explored, and stabilization of the identified CVs is achieved between iterations 5 and 6.

Figure 11: Application of six rounds of MESA to Trp-cage in explicit solvent. Each row corresponds to a particular MESA iteration in which all molecular configurations collected in the current and previous rounds of biased sampling are projected as points into the 2D intrinsic manifold spanned by the current CVs . Each column presents a different coloration of the projected data by selected physical order parameters that illuminate the increasing configurational diversity in each successive MESA iteration: the RMSD (column 1), the end-to-end distance measured between the atoms in Asn-1 and Ser-20 (column 2), and the non-local dihedral angle formed by the atoms in Asn-1, Asp-9, Ser-14, and Ser-20. Crosses represent harmonic centers of umbrella sampling runs to be deployed in the current iteration that were identified by the boundary detection algorithm. MESA rapidly drives sampling of the accessible configurational space, and converges within 6 iterations. Umbrella sampling barely advances the frontier between iterations 5 and 6, and the CVs and explored regions are nearly identical up to a linear transformation.

Free energy surface (Step 6). Upon convergence of MESA, we compile all of the simulation snapshots collected over the six iterative cycles to train a terminal autoencoder with which to perform umbrella sampling over the intrinsic manifold. An analogous plot to Fig. 11 for this terminal autoencoder is presented in Fig. 12.

Figure 12: Projections of all molecular snapshots harvested over the course of the six MESA iterations iterations into the 2D intrinsic manifold spanned by the CVs identified by the terminal autoencoder. Points are colored by (left) -RMSD, (center) , and (right) .

We then perform 30 umbrella sampling runs tiling the explored region of the intrinsic manifold spanned by the converged CVs identified by the terminal autoencoder. Finally, we estimate the unbiased free energy surface by solving the WHAM equations (Section 2.2.2). We present in Fig. 13a the free energy surface and in Fig. 13b the estimated uncertainties in the landscape computed by 10 rounds of bootstrap resampling with replacement. We superpose onto the landscape representative molecular configurations to illustrate the folding pathways, and also mark the native state (N) and a partially unfolded state (L) that is similar to the previously reported “loop structure” in which the helix is unfolded while the other secondary structure elements remain intact144, 148, 165, 166. Comparison of Fig. 13a with Fig. 12 reveals a physical interpretation of the CVs . We find to be strongly correlated with the global molecular chirality (Pearson correlation coefficient = -0.92 with 2-tailed p-value 10), indicating that MESA has discovered a CV that distinguishes left-handed, right-handed, and planar molecular configurations. Meanwhile, is strongly correlated with both the end-to-end distance (Pearson correlation coefficient = -0.91 with 2-tailed p-value 10) indicating that MESA identifies a second CV tracking the global linear extent of the peptide.

These two CVs describing the important collective motions of Trp-cage map out a rugged 2D folding landscape and resolve a low-free energy folding pathway indicated by the blue arrows in Fig. 13a. Starting from extended configurations with an intact N-terminal helix, fully unfolded helix, and different global chiralities – states B, C, D, E – these states move along shallow valleys within a relatively flat free energy plateau to reach an intermediate state A where the hydrophobic core and a 7 Å salt bridge between Asp-9 and Arg-16 is formed. In this configuration, Trp-6 is buried in the hydrophobic core and the free energy of the system by 3-7 lower than the plateau. Intermediate state A is structurally quite similar to native state, possessing a RMSD of 3 Å and global molecular planarity. The final transition to the native state N proceeds down a steep free energy valley corresponding to folding of helix and a further decrease in free energy.

The loop structure L represents a low-lying metastable state that is only 2 less stable than the native fold, and accessible over a 2 free energy barrier. Transformation of the native fold into the loop structure is accompanied by unfolding of the helix and the concomitant adoption of a left-handed chirality. In results reported by Juraszek et al. 148 and Kim et al.144, L state is an important intermediate in the nucleation-condensation folding path.

Figure 13: Free energy surfaces for Trp-cage in explicit solvent obtained from 2D biased sampling in MESA CVs. (a,b) Free energy surface and associated estimated uncertainties computed by biased sampling in MESA identified CVs . (c,d) Free energy surface (-RMSD, helix-RMSD) and associated uncertainties computed by reweighting the biased simulation data collected in . -RMSD is the RMSD of all atoms in the peptide from the Trp-cage native fold. helix-RMSD is the RMSD of the atoms in the N-terminal -helix (residues 2-8) from the Trp-cage native fold. Free energies are reported in units of , where , is Boltzmann’s constant and =298 K. Uncertainties in the landscapes are computed by 10 rounds of bootstrap resampling with replacement. Superposed on the landscape are representative molecular configurations in which the N-terminal -helix region is colored in blue, the helix in green, and the C-terminal polyproline region in red, and the Trp-6 side chain explicitly rendered.

In Fig. 13c we project our FES into the two physical order parameters (-RMSD, helix-RMSD) – where helix-RMSD is the RMSD of the atoms in the N-terminal -helix (residues 2-8) from the Trp-cage native fold – in order to draw comparisons with landscapes and folding pathways previously reported by Juraszek et al. (Fig. 3) 148, Kim et al. (Fig. 2) 144, Paschek et al. (Fig. 3) 167, and Deng et al. (Fig. 5) 156. The FES mapped out in these order parameters by MESA shows good agreement with that computed by replica exchange molecular dynamics (REMD) by Juraszek et al. employing the OPLSAA force field 168 and SPC water model169. Our folding pathway also closely resembles the diffusion-collision I-P-N folding path resolved in that study using transition path sampling (TPS). Conversely, in our calculations the N-terminal -helix is always intact, whereas Juraszek et al. observe it to unfold, opening up a second nucleation-condensation folding pathway via the L state. Interestingly, this folding path is only observable using TPS, and proceeds through regions of the FES unmapped by the REMD calculations, which – as the authors observe – had difficulty surmounting high free energy barriers and may not be fully converged. We note that preservation of -helical content in the unfolded state is consistent with spectroscopy measurements conducted by Ahmed et al. 170 suggesting early formation of the -helix 148. The Amber03 force field employed in this work possesses an elevated propensity for -helicity relative to the OPLSAA force field employed by Juraszek et al. 171, 148, 167, which we suggest may be at the root of the high stability of the -helix observed in our simulations.

Similarly, our FES also shows good correspondence with that determined by Kim et al. 144 employing the Amber ff03w force field172 and TIP4P/2005 water model173. In particular, the line of metastable basins at helix-RMSD 0.3 Å map to the F, I, J, and K basins in the vicinity of the native fold identified in that work. However, our landscape only extends up to helix-RMSD values of 2 Å with the N-terminal -helix always intact, whereas Kim et al. resolve up to helix-RMSD values of 4.5 Å. These authors achieve helix unfolding by high-temperature denaturation and estimate the FES from 25 independent folding trajectories that are terminated upon attaining the native fold. Accordingly – as the authors observe – the simulation data are not sampled from an equilibrated distribution, and the FES cannot be considered to be thermodynamically converged. Accordingly, we suggest that the high-helix-RMSD configurations resolved by Kim et al. actually lie far higher in free energy, are not thermally accessible at room temperature, and are a result of the simulation protocol in which folding runs were initialized from thermally denatured starting structures. The folding pathway resolved by MESA shows good correspondence with the latter portion of the diffusion-collision pathway (Path B) identified by Kim et al. using diffusion maps. We also resolve the metastable loop structure L termed LOOP-I by Kim et al., the near-native intermediate B that is termed HLX-I. We do not resolve the upper portion of the diffusion-collision pathway, nor the nucleation-condensation pathway (Path A), both of which correspond to folding of the initially unfolded N-terminal -helix. Interestingly, the structural details and relative propensity of the nucleation-condensation pathways identified by Juraszek et al. and Kim et al. are not in agreement. Kim et al. ascribe these discrepancies to differences in the force field, but it is also likely a result of the nonequilibrium manner in which data were collected in that work 144. We also suggest that the root of the approximate symmetry in CV in the horseshoe-like landscape resolved by Kim et al. is the global molecular chirality 144 – which is the same origin of our approximate symmetry in CV – and that this CV does not distinguish the two folding mechanisms. Careful inspection of Fig. 8 in that work in which points are colored by helix-RMSD, suggests that both the diffusion-collision and nucleation condensation pathways can proceed under each global chirality 144. In our work, we also observe the diffusion-collision path to proceed along the left- and right-handed chiralities, but do not resolve the nucleation-condensation routes since the N-terminal -helix remains intact.

Our folding pathways show the best correspondence with those identified by Paschek et al. 167, who employed REMD simulations using the Amber ff94 force field174 and TIP3P water model 175. In excellent agreement with our calculations, these authors identified a single diffusion-collision folding pathway in good correspondence with that identified by both Juraszek et al. 148 and Kim et al. 144. Paschek et al. share our concerns, however, that the excessive stability of helices in the protein force field may artificially overemphasize this folding route 167. Our results are also in good agreement with the work of Deng at al. 156, who analyzed using Markov state models a long 208 s simulation performed by Lindorff-Larsen et al.176 employing the Charm22 force field 177, 178 and TIP3P water model 175 These researchers also resolved competing diffusion-collision and nucleation-condensation pathways, but found the former to be overwhelmingly more probable than the latter.

In sum, MESA achieves automated data-driven discovery of two CVs parameterizing the important collective motions of Trp-cage and in which it executes accelerated sampling of configurational space. The resultant FES in these two CVs identifies the important (meta)stable configurational states, and our accelerated sampling protocol probes a 20 range in free energy. The FES, and folding pathways it contains, are in good agreement with prior work 148, 144, 167, 156, although in the present study we do not observe unfolding of the N-terminal -helix and therefore do not resolve the nucleation-condensation folding pathway. The stability of the N-terminal -helix is likely largely attributable to the favorability of helical secondary structure within the Amber03 force field 122, 171, 167, and it would be of interest to compare the CVs, FES, and folding pathways resolved by MESA under different choices of protein force fields and solvent models.

We observe that MESA also offers a means to enhance configurational sampling of particular secondary structural elements by tuning the error function used to train the autoencoder to focus on particular regions of the protein (Eqn. 6). Our current choice of error function trains the network to optimally reconstruct – under translation and rigid rotation – the protein coordinates, wherein the contribution of each atom to the reconstruction error is equally weighted. By adapting the error function to more heavily weight particular atoms, we can train the network to focus on accurate reconstruction of particular regions of the protein secondary structure. In the present case, placing more weight on the atoms residing within the N-terminal -helix may nudge MESA to discover CVs tailored to this region and assist in improved sampling of secondary structural changes in this region.

Benchmarking of MESA acceleration. In aggregate, the six rounds of MESA required 72 GPU-hours on a GeForce GTX 960 card, with network training costs dominated by the 300 ns of molecular simulation. To quantify the acceleration in exploration of configurational space offered by MESA, we conducted a 72 GPU-hour unbiased simulation commencing from the Trp-cage native state to generate a 300 ns simulation trajectory. Projection in Fig. 14a of the simulation snapshots resulting from this unbiased calculation into the CVs of the terminal autoencoder reveals that it is entirely confined to the native basin, and samples a tiny fraction of the configurational space explored by MESA. The corresponding unbiased FES in Fig. 14b shows that the unbiased run probes a free energy range of only 7 , compared to the far more expansive sampling and 20 range explored by MESA (Fig. 13a).

Figure 14: Benchmarking MESA acceleration for Trp-cage. (a) Projection into the MESA identified CVs the collated configurations collected throughout the six iterations of MESA (blue), and a 300 ns unbiased MD simulation commencing from the Trp-cage native state (red). Both MESA and the unbiased run require 72 GPU-hours of execution time. (b) The free energy surface computed from the unbiased simulation trajectory resolves only the native basin and samples a 7 free energy range, compared to the far more comprehensive configurational sampling and 20 range achieved by MESA (Fig. 13a).

4 Conclusions

In this work, we have introduced a new methodology – Molecular Enhanced Sampling with Autoencoders (MESA) – for the data-driven accelerated order parameter discovery and accelerated sampling of the configurational space of macromolecular and biomolecular systems. The methodology requires minimal prior knowledge of the system, requiring only an initial short simulation trajectory to seed iterative, on-the-fly discovery of low-dimensional collective variables and accelerated sampling of configurational space. The essence of our approach is to iteratively train auto-associative artificial neural networks (autoencoders) to learn low-dimensional projections into nonlinear collective variables (CVs) coincident with the important collective molecular motions and which – in contrast with other nonlinear manifold learning approaches – are explicit and differentiable functions of the atomic coordinates. The availability of this explicit mapping admits analytical expressions propagating biasing potentials in the CVs to biasing forces on the atoms, thereby enabling accelerated sampling directly in the data-driven CVs using collective variable biasing techniques. The capacity to apply biasing forces directly within the nonlinear CVs distinguishes our approach from existing data-driven accelerated sampling techniques that appeal to proxy variables, basis set projections, local linear dimensionality reduction, or judicious initialization of unbiased calculations to enhance sampling. We perform interleaved rounds of autoencoder CV discovery and biased sampling until we achieve convergence in the CVs and region of explored configurational space. We have largely automated the MESA framework to require minimal user intervention, and made it freely available for public download (https://github.com/weiHelloWorld/accelerated_sampling_with_autoencoder) together with a plugin implementing the biasing forces within the OpenMM simulation package 86, 88 (https://github.com/weiHelloWorld/ANN_Force).

We have validated MESA in simulations of alanine dipeptide in vacuum and Trp-cage in explicit water. In each case we recover low-dimensional nonlinear CVs and efficiently drive sampling of configurational space to recover converged free energy surfaces (FES) resolving the important metastable states and folding pathways. Our application to alanine dipeptide illuminates the difficulties associated with periodicities in the inherent manifold – in this case a flat torus – and we demonstrate how to confront this issue by increasing the dimensionality of the embedding space to fully unfold the manifold and prevent spurious intersections. In the case of Trp-cage, we recover a FES and diffusion-collision folding mechanism in good agreement with prior work. The favorability of helical secondary structure within the protein force field means that we did not observe unfolding of the N-terminal -helix, and therefore did not resolve the nucleation-condensation folding pathway. It would be of interest in future work to compare the MESA CVs and FES predictions under different choices of protein and solvent potential functions. Benchmarking of the acceleration offered by MESA in sampling the Trp-cage configurational space demonstrated a three-fold increase in the sampled range of free energy and massive enhancement over the exploration achieved by unbiased calculations utilizing the same computational resources. We expect MESA to prove most valuable for large biomolecular systems about which there is little expert knowledge of the important conformational motions and where expensive computation is at a premium. Preliminary testing demonstrates that autoencoders with 15,000-1000-3-1000-15,000 topologies can be efficiently trained over 32,000 configurations in under 40 GPU-minutes, supporting the scalability of network training to large molecular systems. Generation of sufficient quantities of training data through biased simulation and robust application of regularization are of prime concern in achieving good network performance while avoiding overfitting in these sizable networks. We note that the computational burden of network training may be alleviated through coarse-grained representations of the molecule to the autoencoder employing, for example, only backbone atom coordinates or internal dihedral angles, and of biased simulation through the use of coarse-grained or multi-resolution molecular models 179.

We see many avenues for future development of our approach. First, we have employed umbrella sampling as an efficient and embarrassingly parallel accelerated sampling approach, but anticipate that variants of metadynamics 29, 30, 31 may offer an attractive alternative by eliminating the need for boundary detection, initialization of swarms of independent umbrella runs, and adaptive tuning of umbrella potential force constants. Second, the development of bespoke error functions used to train the network presents a powerful means to incorporate prior system knowledge and focus the calculated CVs and accelerated sampling on particular regions and features of the system. For example, this approach may be of value in unfolding particular secondary structure elements, focusing sampling on enzymatic active sites, or actuating important allosteric motions. One drawback of the current error function is that there is no ranking of the CVs discovered by the autoencoder, such that the relative importance of the collective modes is not apparent. The hierarchical error functions proposed by Scholz and Vigário 82, 81 extract CVs within a nested hierarchy of subnetworks that discover successively higher dimensional projections that minimize the reconstruction error subject to the constraint that all lower dimensional projections also minimize the reconstruction error. This imposition of a rank ordering on the CVs can help interpret the relative importance of the identified collective motions and also stabilize the CV discovery 82. Third, rather than training a single network over all of the collated simulation data in each iteration to discover and refine good global CVs, we propose that it may be beneficial to train independent networks over different regions of the inherent manifold in order to discover good local CVs. In the spirit of local tangent space alignment 55 and iMapD 80, local CVs adapted to their neighborhood of configurational space can resolve the collective motions relevant to that region and improve the efficiency of the enhanced sampling. The biased sampling data furnished from each network may then be patched together using reweighting techniques 106, and the terminal landscape constructed in the union of all local CVs or a unified set of global CVs identified post hoc in the final MESA iteration.

Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant No. CHE-1664426.

Appendix

Figure A1: Determination of the dimensionality of the intrinsic manifold in each iteration of MESA applied to alanine dipeptide in vacuum. The fraction of variance explained (FVE) by the autoencoder reconstructed outputs as a function of the number of bottleneck nodes was computed according to Eqn. 4. For each value of

at each iteration, we independently initialized and trained five autoencoders with identical topologies over the same data and plotted the calculated distribution as a boxplot. The boxes span the upper and lower quartiles of the FVE distribution, while the whiskers show the range of the distribution with the exception of outliers that are rendered as points. A knee in each plot was ascertained using the L-method that determines the two piecewise linear fits that best explain the data

101. The dimensionality of the intrinsic manifold is determined to be =2 for all 10 iterations of MESA. Boxplots were generated using the seaborn Python libraries (https://seaborn.pydata.org).
Figure A2: Determination of the dimensionality of the intrinsic manifold in each iteration of MESA applied to Trp-cage in water. Construction, interpretation, and analysis of the boxplots is identical to that in Fig. A1. The dimensionality of the intrinsic manifold is determined to be =2 for all 6 iterations of MESA.

References

  • 1 Hashemian, B.; Millán, D. and Arroyo, M., The Journal of Chemical Physics, 2013, 139(21).
  • 2 Bernardi, R. C.; Melo, M. C. and Schulten, K., Biochimica et Biophysica Acta, 2015, 1850(5), 872–877.
  • 3 Karplus, M. and Petsko, G. A., Nature, 1990, 347(6294), 631–639.
  • 4 Rohrdanz, M. A.; Zheng, W. and Clementi, C., Annual Review of Physical Chemistry, 2013, 64, 295–316.
  • 5 Abrams, C. and Bussi, G., Entropy, 2013, 16(1), 163–199.
  • 6 Dellago, C.; Bolhuis, P. G.; Csajka, F. S. and Chandler, D., The Journal of Chemical Physics, 1998, 108(5), 1964–1977.
  • 7 Bolhuis, P. G.; Chandler, D.; Dellago, C. and Geissler, P. L., Annual Review of Physical Chemistry, 2002, 53(1), 291–318.
  • 8 Rogal, J. and Bolhuis, P. G., The Journal of Chemical Physics, 2008, 129(22), 224107.
  • 9 Moroni, D.; Bolhuis, P. G. and van Erp, T. S., The Journal of Chemical Physics, 2004, 120(9), 4055–4065.
  • 10 van Erp, T. S.; Moroni, D. and Bolhuis, P. G., The Journal of Chemical Physics, 2003, 118(17), 7762–7774.
  • 11 Weinan, E.; Ren, W. and Vanden-Eijnden, E., Physical Review B, 2002, 66(5), 052301.
  • 12 Weinan, E.; Ren, W. and Vanden-Eijnden, E., Journal of Physical Chemistry B, 2005, 109(14), 6688–6693.
  • 13 Jónsson, H.; Mills, G. and Jacobsen, K. W. In Classical and Quantum Dynamics in Condensed Phase Simulations, Berne, B. J.; Ciccotti, G. and Coker, D. F., Eds.; World Scientific, Singapore, 1998; page 385.
  • 14 Sheppard, D.; Terrell, R. and Henkelman, G., The Journal of Chemical Physics, 2008, 128(13), 134106.
  • 15 Borrero, E. E. and Escobedo, F. A., The Journal of Chemical Physics, 2007, 127(16), 164101.
  • 16 Allen, R. J.; Valeriani, C. and ten Wolde, P. R., Journal of Physics: Condensed Matter, 2009, 21(46), 463102.
  • 17 Brooks, S. P. and Morgan, B. J., The Statistician, 1995, pages 241–257.
  • 18 Berg, B. A. and Neuhaus, T., Physical Review Letters, 1992, 68(1), 9.
  • 19 Hansmann, U. H., Chemical Physics Letters, 1997, 281(1), 140–150.
  • 20 Sugita, Y. and Okamoto, Y., Chemical Physics Letters, 1999, 314(1), 141–151.
  • 21 Sugita, Y. and Okamoto, Y., Chemical Physics Letters, 2000, 329(3), 261–270.
  • 22 Mitsutake, A.; Sugita, Y. and Okamoto, Y., Peptide Science, 2001, 60(2), 96–123.
  • 23 Fukunishi, H.; Watanabe, O. and Takada, S., The Journal of Chemical Physics, 2002, 116(20), 9058–9067.
  • 24 Okamoto, Y., Journal of Molecular Graphics and Modelling, 2004, 22(5), 425–439.
  • 25 Liu, P.; Huang, X.; Zhou, R. and Berne, B. J., The Journal of Physical Chemistry B, 2006, 110(38), 19018–19022.
  • 26 Torrie, G. M. and Valleau, J. P., Journal of Computational Physics, 1977, 23(2), 187–199.
  • 27 Voter, A. F., Physical Review Letters, 1997, 78(20), 3908.
  • 28 Grubmüller, H., Physical Review E, 1995, 52(3), 2893.
  • 29 Laio, A. and Parrinello, M., Proceedings of the National Academy of Sciences, 2002, 99(20), 12562–12566.
  • 30 Huber, T.; Torda, A. E. and van Gunsteren, W. F., Journal of Computer-Aided Molecular Design, 1994, 8(6), 695–708.
  • 31 Barducci, A.; Bussi, G. and Parrinello, M., Physical Review Letters, 2008, 100(2), 020603.
  • 32 Rosso, L.; Mináry, P.; Zhu, Z. and Tuckerman, M. E., The Journal of Chemical Physics, 2002, 116(11), 4389–4402.
  • 33 Maragliano, L. and Vanden-Eijnden, E., Chemical Physics Letters, 2006, 426(1), 168–175.
  • 34 Abrams, J. B. and Tuckerman, M. E., The Journal of Physical Chemistry B, 2008, 112(49), 15742–15757.
  • 35 So/rensen, M. R. and Voter, A. F., The Journal of Chemical Physics, 2000, 112(21), 9599–9606.
  • 36 den Otter, W. K. and Briels, W. J., The Journal of Chemical Physics, 1998, 109(11), 4139–4146.
  • 37 Carter, E.; Ciccotti, G.; Hynes, J. T. and Kapral, R., Chemical Physics Letters, 1989, 156(5), 472–477.
  • 38 Ciccotti, G. and Ferrario, M., Molecular Simulation, 2004, 30(11-12), 787–793.
  • 39 Darve, E.; Rodríguez-Gómez, D. and Pohorille, A., The Journal of Chemical Physics, 2008, 128(14), 144120.
  • 40 Kirkwood, J. G., The Journal of Chemical Physics, 1935, 3(5), 300–313.
  • 41 Straatsma, T. and Berendsen, H., The Journal of Chemical Physics, 1988, 89(9), 5876–5886.
  • 42 Wang, J. and Ferguson, A., Molecular Simulation, 2017, pages 1–18.
  • 43 Hegger, R.; Altis, A.; Nguyen, P. H. and Stock, G., Physical Review Letters, 2007, 98(2), 028102.
  • 44 Ferguson, A. L.; Panagiotopoulos, A. Z.; Kevrekidis, I. G. and Debenedetti, P. G., Chemical Physics Letters, 2011, 509(1), 1–11.
  • 45 Ferguson, A. L.; Panagiotopoulos, A. Z.; Debenedetti, P. G. and Kevrekidis, I. G., Proceedings of the National Academy of Sciences, 2010, 107(31), 13597–13602.
  • 46 Zhuravlev, P. I.; Materese, C. K. and Papoian, G. A., The Journal of Physical Chemistry B, 2009, 113(26), 8800–8812.
  • 47 Amadei, A.; Linssen, A. and Berendsen, H. J., Proteins: Structure, Function, and Bioinformatics, 1993, 17(4), 412–425.
  • 48 García, A. E., Physical Review Letters, 1992, 68(17), 2696.
  • 49 Das, P.; Moll, M.; Stamati, H.; Kavraki, L. E. and Clementi, C., Proceedings of the National Academy of Sciences, 2006, 103(26), 9885–9890.
  • 50 Stamati, H.; Clementi, C. and Kavraki, L. E., Proteins: Structure, Function, and Bioinformatics, 2010, 78(2), 223–235.
  • 51 Ichiye, T. and Karplus, M., Proteins: Structure, Function, and Bioinformatics, 1991, 11(3), 205–217.
  • 52 Pearson, K., The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 1901, 2(11), 559–572.
  • 53 Troyer, J. M. and Cohen, F. E., Proteins: Structure, Function, and Bioinformatics, 1995, 23(1), 97–110.
  • 54 Schölkopf, B.; Smola, A. and Müller, K.-R. In International Conference on Artificial Neural Networks, pages 583–588. Springer, 1997.
  • 55 Wang, J., Geometric Structure of High-Dimensional Data and Dimensionality Reduction, Springer, 2011.
  • 56 Roweis, S. T. and Saul, L. K., Science, 2000, 290(5500), 2323–2326.
  • 57 Zhang, Z. and Wang, J. In Advances in Neural Information Processing Systems, pages 1593–1600, 2007.
  • 58 Tenenbaum, J. B.; De Silva, V. and Langford, J. C., Science, 2000, 290(5500), 2319–2323.
  • 59 Weinberger, K. Q. and Saul, L. K.,

    International Journal of Computer Vision

    , 2006, 70(1), 77–90.
  • 60 Li, C.-g.; Guo, J.; Chen, G.; Nie, X.-f. and Yang, Z. In 2006 International Conference on Machine Learning and Cybernetics, pages 3201–3206. IEEE, 2006.
  • 61 Belkin, M. and Niyogi, P. In Advances in Neural Information Processing Systems, pages 585–591, 2002.
  • 62 Donoho, D. L. and Grimes, C., Proceedings of the National Academy of Sciences, 2003, 100(10), 5591–5596.
  • 63 Ferguson, A. L.; Panagiotopoulos, A. Z.; Debenedetti, P. G. and Kevrekidis, I. G., The Journal of Chemical Physics, 2011, 134(13), 135103.
  • 64 Coifman, R. R. and Lafon, S., Applied and Computational Harmonic Analysis, 2006, 21(1), 5–30.
  • 65 Rohrdanz, M. A.; Zheng, W.; Maggioni, M. and Clementi, C., The Journal of Chemical Physics, 2011, 134(12), 03B624.
  • 66 Preto, J. and Clementi, C., Physical Chemistry Chemical Physics, 2014, 16(36), 19181–19191.
  • 67 Ceriotti, M.; Tribello, G. A. and Parrinello, M., Proceedings of the National Academy of Sciences, 2011, 108(32), 13023–13028.
  • 68 Tribello, G. A.; Ceriotti, M. and Parrinello, M., Proceedings of the National Academy of Sciences, 2012, 109(14), 5196–5201.
  • 69 Ceriotti, M.; Tribello, G. A. and Parrinello, M., Journal of Chemical Theory and Computation, 2013, 9(3), 1521–1532.
  • 70 Ferguson, A. L.; Zhang, S.; Dikiy, I.; Panagiotopoulos, A. Z.; Debenedetti, P. G. and Link, A. J., Biophysical Journal, 2010, 99(9), 3056–3065.
  • 71 Zheng, W.; Rohrdanz, M. A.; Maggioni, M. and Clementi, C., The Journal of Chemical Physics, 2011, 134(14), 144109.
  • 72 Zheng, W.; Rohrdanz, M. A. and Clementi, C., The journal of physical chemistry B, 2013, 117(42), 12769–12776.
  • 73 Spiwok, V. and Králová, B., The Journal of Chemical Physics, 2011, 135(22), 224504.
  • 74 Fiorin, G.; Klein, M. L. and Hénin, J., Molecular Physics, 2013, 111(22-23), 3345–3362.
  • 75 Ma, A. and Dinner, A. R., The Journal of Physical Chemistry B, 2005, 109(14), 6769–6779.
  • 76 Peters, B. and Trout, B. L., The Journal of Chemical Physics, 2006, 125(5), 054108.
  • 77 Peters, B.; Beckham, G. T. and Trout, B. L., The Journal of Chemical Physics, 2007, 127(3), 034109.
  • 78 Abrams, C. F. and Vanden-Eijnden, E., Chemical Physics Letters, 2012, 547, 114–119.
  • 79 Branduardi, D.; Gervasio, F. L. and Parrinello, M., The Journal of Chemical Physics, 2007, 126(5), 054103.
  • 80 Chiavazzo, E.; Covino, R.; Coifman, R. R.; Gear, C. W.; Georgiou, A. S.; Hummer, G. and Kevrekidis, I. G., Proceedings of the National Academy of Sciences, 2017, page 201621481.
  • 81 Scholz, M. and Vigário, R. In ESANN’2002 proceedings - European Symposium on Artificial Neural Networks, pages 439–444, 2002.
  • 82 Scholz, M.; Fraunholz, M. and Selbig, J. In

    Principal manifolds for data visualization and dimension reduction;

    Springer, 2008; pages 44–67.
  • 83 Hinton, G. E. and Salakhutdinov, R. R., Science, 2006, 313(5786), 504–507.
  • 84 Yan, S.; Xu, D.; Zhang, B.; Zhang, H.-J.; Yang, Q. and Lin, S., IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(1), 40–51.
  • 85 Wang, W.; Huang, Y.; Wang, Y. and Wang, L. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    , pages 490–497, 2014.
  • 86 Eastman, P.; Swails, J.; Chodera, J. D.; McGibbon, R. T.; Zhao, Y.; Beauchamp, K. A.; Wang, L.-P.; Simmonett, A. C.; Harrigan, M. P. and Stern, C. D., PLOS Computational Biology, 2017, 13(7), e1005659.
  • 87 Eastman, P.; Friedrichs, M. S.; Chodera, J. D.; Radmer, R. J.; Bruns, C. M.; Ku, J. P.; Beauchamp, K. A.; Lane, T. J.; Wang, L.-P. and Shukla, D., Journal of Chemical Theory and Computation, 2012, 9(1), 461–469.
  • 88 Friedrichs, M. S.; Eastman, P.; Vaidyanathan, V.; Houston, M.; Legrand, S.; Beberg, A. L.; Ensign, D. L.; Bruns, C. M. and Pande, V. S., Journal of Computational Chemistry, 2009, 30(6), 864–872.
  • 89 Hassoun, M. H., Fundamentals of Artificial Neural Networks, MIT press, 1995.
  • 90 McCulloch, W. S. and Pitts, W., The Bulletin of Mathematical Biophysics, 1943, 5(4), 115–133.
  • 91 Chen, T. and Chen, H., IEEE Transactions on Neural Networks, 1995, 6(4), 911–917.
  • 92 Kramer, M. A., AIChE Journal, 1991, 37(2), 233–243.
  • 93 Kirby, M. J. and Miranda, R., Neural Computation, 1996, 8(2), 390–402.
  • 94 Bourlard, H. and Kamp, Y., Biological Cybernetics, 1988, 59(4), 291–294.
  • 95 Friedman, J.; Hastie, T. and Tibshirani, R., The Elements of Statistical Learning, Vol.  1, Springer series in statistics New York, 2001.
  • 96 Collobert, R. and Bengio, S. In Proceedings of the twenty-first international conference on Machine learning, page 23. ACM, 2004.
  • 97 Bengio, Y. In Neural networks: Tricks of the trade; Springer, 2012; pages 437–478.
  • 98 Rumelhart, D. E.; Durbin, R.; Golden, R. and Chauvin, Y., Backpropagation: Theory, architectures and applications, 1995, pages 1–34.
  • 99 Rumelhart, D. E.; Hinton, G. E. and Williams, R. J. In Cognitive Modeling, Polk, T. A. and Seifert, C. M., Eds.; MIT Press, 1988; chapter 8.
  • 100 Sutskever, I.; Martens, J.; Dahl, G. and Hinton, G. In International conference on machine learning, pages 1139–1147, 2013.
  • 101 Salvador, S. and Chan, P. In

    Tools with Artificial Intelligence, 2004. ICTAI 2004. 16th IEEE International Conference on

    , pages 576–584. IEEE, 2004.
  • 102 Krizhevsky, A.; Sutskever, I. and Hinton, G. E. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
  • 103 Kabsch, W., Acta Crystallographica Section A: Crystal Physics, Diffraction, Theoretical and General Crystallography, 1976, 32(5), 922–923.
  • 104 Roux, B., Computer Physics Communications, 1995, 91(1-3), 275–282.
  • 105 Chipot, C. and Pohorille, A., Free Energy Calculations, Springer, 2007.
  • 106 Ferguson, A. L., Journal of Computational Chemistry, 2017, 38(18), 1583–1605.
  • 107 Kumar, S.; Rosenberg, J. M.; Bouzida, D.; Swendsen, R. H. and Kollman, P. A., Journal of Computational Chemistry, 1992, 13(8), 1011–1021.
  • 108 Bartels, C., Chemical Physics Letters, 2000, 331(5), 446–454.
  • 109 Ferrenberg, A. M. and Swendsen, R. H., Physical Review Letters, 1989, 63(12), 1195.
  • 110 Bennett, C. H., Journal of Computational Physics, 1976, 22(2), 245–268.
  • 111 Bartels, C. and Karplus, M., Journal of Computational Chemistry, 1997, 18(12), 1450–1462.
  • 112 Habeck, M., Physical Review Letters, 2012, 109(10), 100601.
  • 113 Zhu, F. and Hummer, G., Journal of Computational Chemistry, 2012, 33(4), 453–465.
  • 114 Gallicchio, E.; Andrec, M.; Felts, A. K. and Levy, R. M., The Journal of Physical Chemistry B, 2005, 109(14), 6722–6731.
  • 115 Hartmann, C.; Latorre, J. C. and Ciccotti, G., The European Physical Journal-Special Topics, 2011, 200(1), 73–89.
  • 116 Neidigh, J. W.; Fesinmeyer, R. M. and Andersen, N. H., Nature Structural & Molecular Biology, 2002, 9(6), 425.
  • 117 Sultan, M. M. and Pande, V. S., The Journal of Physical Chemistry B, in press, 2017, page doi: 10.1021/acs.jpcb.7b06896.
  • 118 Edelsbrunner, H.; Kirkpatrick, D. and Seidel, R., IEEE Transactions on Information Theory, 1983, 29(4), 551–559.
  • 119 Edelsbrunner, H. and Mücke, E. P., ACM Transactions on Graphics, 1994, 13(1), 43–72.
  • 120 Edelsbrunner, H., Algorithms and Combinatorics, 2003, 25, 379–404.
  • 121 Humphrey, W.; Dalke, A. and Schulten, K., Journal of Molecular Graphics, 1996, 14(1), 33–38.
  • 122 Wang, J.; Wolf, R. M.; Caldwell, J. W.; Kollman, P. A. and Case, D. A., Journal of Computational Chemistry, 2004, 25(9), 1157–1174.
  • 123 Schlick, T., Molecular modeling and simulation: an interdisciplinary guide: an interdisciplinary guide, Vol.  21, Springer Science & Business Media, 2010.
  • 124 Hess, B.; Bekker, H.; Berendsen, H. J. and Fraaije, J. G., Journal of Computational Chemistry, 1997, 18(12), 1463–1472.
  • 125 Allen, M. and Tildesley, D., Computer Simulation of Liquids, Oxford University Press, New York, 1987.
  • 126 Horn, H. W.; Swope, W. C.; Pitera, J. W.; Madura, J. D.; Dick, T. J.; Hura, G. L. and Head-Gordon, T., The Journal of Chemical Physics, 2004, 120(20), 9665–9678.
  • 127 Berman, H. M.; Battistuz, T.; Bhat, T. N.; Bluhm, W. F.; Bourne, P. E.; Burkhardt, K.; Feng, Z.; Gilliland, G. L.; Iype, L.; Jain, S.; Fagan, P.; Marvin, J.; Padilla, D.; Ravichandran, V.; Schneider, B.; Thanki, N.; Weissig, H.; Westbrook, J. D. and Zardecki, C., Acta Crystallographica Section D: Biological Crystallography, 2002, 58(6), 899–907.
  • 128 Andersen, H. C., The Journal of Chemical Physics, 1980, 72(4), 2384–2393.
  • 129 Chow, K.-H. and Ferguson, D. M., Computer Physics Communications, 1995, 91(1-3), 283–289.
  • 130 Åqvist, J.; Wennerström, P.; Nervall, M.; Bjelic, S. and Brandsdal, B. O., Chemical Physics Letters, 2004, 384(4), 288–294.
  • 131 Hockney, R. W. and Eastwood, J. W., Computer Simulation Using Particles, CRC Press, 1988.
  • 132 Darden, T.; York, D. and Pedersen, L., The Journal of Chemical Physics, 1993, 98(12), 10089–10092.
  • 133 Al-Rfou, R.; Alain, G.; Almahairi, A.; Angermueller, C.; Bahdanau, D.; Ballas, N.; Bastien, F.; Bayer, J.; Belikov, A.; Belopolsky, A.; Bengio, Y.; Bergeron, A.; Bergstra, J.; Bisson, V.; Bleecher Snyder, J.; Bouchard, N.; Boulanger-Lewandowski, N.; Bouthillier, X.; de Brébisson, A.; Breuleux, O.; Carrier, P.-L.; Cho, K.; Chorowski, J.; Christiano, P.; Cooijmans, T.; Côté, M.-A.; Côté, M.; Courville, A.; Dauphin, Y. N.; Delalleau, O.; Demouth, J.; Desjardins, G.; Dieleman, S.; Dinh, L.; Ducoffe, M.; Dumoulin, V.; Ebrahimi Kahou, S.; Erhan, D.; Fan, Z.; Firat, O.; Germain, M.; Glorot, X.; Goodfellow, I.; Graham, M.; Gulcehre, C.; Hamel, P.; Harlouchet, I.; Heng, J.-P.; Hidasi, B.; Honari, S.; Jain, A.; Jean, S.; Jia, K.; Korobov, M.; Kulkarni, V.; Lamb, A.; Lamblin, P.; Larsen, E.; Laurent, C.; Lee, S.; Lefrancois, S.; Lemieux, S.; Léonard, N.; Lin, Z.; Livezey, J. A.; Lorenz, C.; Lowin, J.; Ma, Q.; Manzagol, P.-A.; Mastropietro, O.; McGibbon, R. T.; Memisevic, R.; van Merriënboer, B.; Michalski, V.; Mirza, M.; Orlandi, A.; Pal, C.; Pascanu, R.; Pezeshki, M.; Raffel, C.; Renshaw, D.; Rocklin, M.; Romero, A.; Roth, M.; Sadowski, P.; Salvatier, J.; Savard, F.; Schlüter, J.; Schulman, J.; Schwartz, G.; Serban, I. V.; Serdyuk, D.; Shabanian, S.; Simon, E.; Spieckermann, S.; Subramanyam, S. R.; Sygnowski, J.; Tanguay, J.; van Tulder, G.; Turian, J.; Urban, S.; Vincent, P.; Visin, F.; de Vries, H.; Warde-Farley, D.; Webb, D. J.; Willson, M.; Xu, K.; Xue, L.; Yao, L.; Zhang, S. and Zhang, Y., arXiv e-prints, 2016, http://arxiv.org/abs/1605.02688.
  • 134 Hashemian, B. and Arroyo, M., The Journal of Chemical Physics, 2015, 142(4), 044102.
  • 135 Chodera, J. D.; Swope, W. C.; Pitera, J. W.; Seok, C. and Dill, K. A., Journal of Chemical Theory and Computation, 2007, 3(1), 26–41.
  • 136 Hummer, G. and Kevrekidis, I. G., The Journal of Chemical Physics, 2003, 118(23), 10762–10773.
  • 137 Chodera, J. D.; Swope, W. C.; Pitera, J. W. and Dill, K. A., Multiscale Modeling & Simulation, 2006, 5(4), 1214–1226.
  • 138 Chodera, J. D.; Singhal, N.; Pande, V. S.; Dill, K. A. and Swope, W. C., The Journal of Chemical Physics, 2007, 126(15), 04B616.
  • 139 Bolhuis, P. G.; Dellago, C. and Chandler, D., Proceedings of the National Academy of Sciences, 2000, 97(11), 5877–5882.
  • 140 Jákli, I.; Jensen, S. J. K.; Csizmadia, I. G. and Perczel, A., Chemical Physics Letters, 2012, 547, 82–88.
  • 141 Patrangenaru, V. and Ellingson, L., Nonparametric Statistics on Manifolds and Their Applications to Object Data Analysis, CRC Press, 2015.
  • 142 Whitney, H., Annals of Mathematics, 1936, pages 645–680.
  • 143 Broomhead, D. S. and King, G. P., Physica D: Nonlinear Phenomena, 1986, 20(2-3), 217–236.
  • 144 Kim, S. B.; Dsilva, C. J.; Kevrekidis, I. G. and Debenedetti, P. G., The Journal of Chemical Physics, 2015, 142(8), 085101.
  • 145 Qiu, L.; Pabit, S. A.; Roitberg, A. E. and Hagen, S. J., Journal of the American Chemical Society, 2002, 124(44), 12952–12953.
  • 146 Seshasayee, A. S. N., Theoretical Biology and Medical Modelling, 2005, 2(1), 7.
  • 147 Heyda, J.; Kozisek, M.; Bednárova, L.; Thompson, G.; Konvalinka, J.; Vondrasek, J. and Jungwirth, P., The Journal of Physical Chemistry B, 2011, 115(28), 8910–8924.
  • 148 Juraszek, J. and Bolhuis, P., Proceedings of the National Academy of Sciences, 2006, 103(43), 15859–15864.
  • 149 Kim, S. B.; Gupta, D. R. and Debenedetti, P. G., Scientific Reports, 2016, 6, 25612.
  • 150 Hatch, H. W.; Stillinger, F. H. and Debenedetti, P. G., The Journal of Physical Chemistry B, 2014, 118(28), 7761–7769.
  • 151 Kannan, S. and Zacharias, M., PloS ONE, 2014, 9(2), e88383.
  • 152 Schug, A.; Wenzel, W. and Hansmann, U. H., The Journal of Chemical Physics, 2005, 122(19), 194711.
  • 153 Snow, C. D.; Zagrovic, B. and Pande, V. S., Journal of the American Chemical Society, 2002, 124(49), 14548–14549.
  • 154 Zhou, R., Proceedings of the National Academy of Sciences, 2003, 100(23), 13280–13285.
  • 155 Zhou, R., Proteins: Structure, Function, and Bioinformatics, 2003, 53(2), 148–161.
  • 156 Deng, N.-j.; Dai, W. and Levy, R. M., The Journal of Physical Chemistry B, 2013, 117(42), 12787–12799.
  • 157 Patriksson, A.; Adams, C. M.; Kjeldsen, F.; Zubarev, R. A. and van der Spoel, D., The Journal of Physical Chemistry B, 2007, 111(46), 13147–13150.
  • 158 Pitera, J. W. and Swope, W., Proceedings of the National Academy of Sciences, 2003, 100(13), 7587–7592.
  • 159 Meuzelaar, H.; Marino, K. A.; Huerta-Viga, A.; Panman, M. R.; Smeenk, L. E.; Kettelarij, A. J.; van Maarseveen, J. H.; Timmerman, P.; Bolhuis, P. G. and Woutersen, S., The Journal of Physical Chemistry B, 2013, 117(39), 11490–11501.
  • 160 Barua, B.; Lin, J. C.; Williams, V. D.; Kummler, P.; Neidigh, J. W. and Andersen, N. H., Protein Engineering, Design & Selection, 2008, 21(3), 171–185.
  • 161 Iavarone, A. T. and Parks, J. H., Journal of the American Chemical Society, 2005, 127(24), 8606–8607.
  • 162 Rovó, P.; Stráner, P.; Láng, A.; Bartha, I.; Huszár, K.; Nyitray, L. and Perczel, A., Chemistry-A European Journal, 2013, 19(8), 2628–2640.
  • 163 Adams, C. M.; Kjeldsen, F.; Patriksson, A.; van Der Spoel, D.; Gräslund, A.; Papadopoulos, E. and Zubarev, R. A., International Journal of Mass Spectrometry, 2006, 253(3), 263–273.
  • 164 Bolhuis, P. G., Frontiers in Bioscience, 2009, 14, 2801–2828.
  • 165 Kannan, S. and Zacharias, M., Proteins: Structure, Function, and Bioinformatics, 2009, 76(2), 448–460.
  • 166 Shao, Q.; Shi, J. and Zhu, W., The Journal of Chemical Physics, 2012, 137(12), 125103.
  • 167 Paschek, D.; Hempel, S. and García, A. E., Proceedings of the National Academy of Sciences, 2008, 105(46), 17754–17759.
  • 168 Kaminski, G. A.; Friesner, R. A.; Tirado-Rives, J. and Jorgensen, W. L., The Journal of Physical Chemistry B, 2001, 105(28), 6474–6487.
  • 169 Lindahl, E.; Hess, B. and Van Der Spoel, D., Journal of Molecular Modeling, 2001, 7(8), 306–317.
  • 170 Ahmed, Z.; Beta, I. A.; Mikhonin, A. V. and Asher, S. A., Journal of the American Chemical Society, 2005, 127(31), 10943–10950.
  • 171 Best, R. B.; Buchete, N.-V. and Hummer, G., Biophysical Journal, 2008, 95(1), L07–L09.
  • 172 Best, R. B. and Mittal, J., The Journal of Physical Chemistry B, 2010, 114(46), 14916–14923.
  • 173 Abascal, J. L. and Vega, C., The Journal of Chemical Physics, 2005, 123(23), 234505.
  • 174 Cornell, W. D.; Cieplak, P.; Bayly, C. I.; Gould, I. R.; Merz, K. M.; Ferguson, D. M.; Spellmeyer, D. C.; Fox, T.; Caldwell, J. W. and Kollman, P. A., Journal of the American Chemical Society, 1995, 117(19), 5179–5197.
  • 175 Jorgensen, W. L.; Chandrasekhar, J.; Madura, J. D.; Impey, R. W. and Klein, M. L., The Journal of Chemical Physics, 1983, 79(2), 926–935.
  • 176 Lindorff-Larsen, K.; Piana, S.; Dror, R. O. and Shaw, D. E., Science, 2011, 334(6055), 517–520.
  • 177 MacKerell, A. D.; Bashford, D.; Bellott, M.; Dunbrack, R. L.; Evanseck, J. D.; Field, M. J.; Fischer, S.; Gao, J.; Guo, H.; Ha, S.; Joseph-McCarthy, D.; Kuchnir, L.; Kuczera, K.; Lau, F. T. K.; Mattos, C.; Michnick, S.; Ngo, T.; Nguyen, D. T.; Prodhom, B.; Reiher, W. E.; Roux, B.; Schlenkrich, M.; Smith, J. C.; Stote, R.; Straub, J.; Watanabe, M.; Wiórkiewicz-Kuczera, J.; Yin, D. and Karplus, M., The Journal of Physical Chemistry B, 1998, 102(18), 3586–3616.
  • 178 MacKerell, A. D.; Feig, M. and Brooks, C. L., Journal of Computational Chemistry, 2004, 25(11), 1400–1415.
  • 179 Kmiecik, S.; Gront, D.; Kolinski, M.; Wieteska, L.; Dawid, A. E. and Kolinski, A., Chemical Reviews, 2016, 116(14), 7898–7936.