Molecular dynamics (MD) simulations have long been an important tool for studying molecular systems by providing atomistic insight into physicochemical processes that cannot be easily obtained through experimentation. A key step in extracting kinetic information from molecular simulation is the recovery of the slow dynamical modes that govern the long-time evolution of system coordinates within a low-dimensional latent space. The variational approach to conformational dynamics (VAC) Noé and Nuske (2013); Nüske et al. (2014)
has been successful in providing a mathematical framework through which the eigenfunctions of the underlying transfer operator can be estimatedNoé and Clementi (2017); Klus et al. (2018)
. A special case of VAC which estimates linearly optimal slow modes from mean-free input coordinates is known as time-lagged independent component analysis (TICA)Pérez-Hernández et al. (2013); Noé and Nuske (2013); Nüske et al. (2014); Noé and Clementi (2015); Noé et al. (2016); Pérez-Hernández and Noé (2016); Schwantes and Pande (2013); Klus et al. (2018).
TICA is a widely used approach that has become a standard step in the Markov state modeling pipeline Husic and Pande (2018). However, it is restricted to form linear combinations of the input coordinates and is unable to learn nonlinear transformations that are typically required to recover high resolution kinetic models of all but the simplest molecular systems. Schwantes et al. address this limitation by applying the kernel trick with TICA to learn non-linear functions of the input coordinates Schwantes and Pande (2015)
. A special case of a radial basis function kernels was realized by Noé and Nuske in the direct application of VAC using Gaussian functionsNoé and Nuske (2013). Kernel TICA (kTICA), however, suffers from a number of drawbacks that have precluded its broad adoption. First, its implementation requires memory usage and computation time for a dataset of size
, which becomes intractable for large datasets. In practice, this issue can be adequately alleviated by selecting a small number of landmark points to which kernel TICA is applied, and then constructing interpolative projections of the remaining points using the Nyström extensionHarrigan and Pande (2017)
. Second, kTICA is typically highly sensitive to the kernel hyperparametersHarrigan and Pande (2017), and extensive hyperparameter tuning is typically required to obtain acceptable results. Third, the use of the kernel trick and Nyström extension compromises differentiability of the latent space projection. Exact computation of the gradient of any new point requires the expensive calculation of kernel function of new data with respect to all points in the training set, and gradient estimations based only on the landmarks are inherently approximate M. Sultan and Pande (2017). Due to their high cost and/or instability of these gradient estimates, the slow modes estimated through kTICA are impractical for use as collective variables (CVs) for enhanced sampling or reduced-dimensional modeling approaches that require exact gradients of the CVs with respect to the input coordinates.
Deep learning offers an attractive alternative to kTICA as means to solve theses challenges. Artificial neural networks are capable of learning nonlinear functions of arbitrary complexityHassoun (1995); Chen and Chen (1995)
, are generically scalable to large datasets with training scaling linearly with the size of the training set, the network predictions are relatively robust to the choice of network architecture and activation functions, and exact expressions for the derivatives of the learned CVs with respect to the input coordinates are available by automatic differentiationBastien et al. (2012); Baydin et al. (2018); Paszke et al. (2017); Schmidhuber (2015)
. A number of approaches utilizing artificial neural networks to approximate eigenfunctions of the dynamical operator have been proposed. Time-lagged autoencodersWehmeyer and Noé (2018) utilize auto-associative neural networks to reconstruct a time-lagged signal, with suitable collective variables extracted from the bottleneck layer. Variational dynamics encoders (VDEs) Hernández et al. (2018) combine time-lagged reconstruction loss and autocorrelation maximization within a variational autoencoder. While the exact relationship between the regression approach employed in time-lagged autoencoders and the VAC framework is not yet formalized Wehmeyer and Noé (2018), variational autoencoders (VAEs) have already been studied as estimators of the dynamical propagator Hernández et al. (2018) and in furnishing collective variables for enhanced sampling Sultan et al. (2018). The most pressing limitation of VAE approaches to date is their restriction to the estimation of the single leading eigenfunctions of the dynamical propagator. The absence of the full spectrum of slow modes fails to expose the full richness of the underlying dynamics, and limits enhanced sampling calculations in the learned CVs to acceleration along a single coordinate that may be insufficient to drive all relevant conformational interconversions.
In this work, we propose a deep-learning based method to estimate the slow dynamical modes that we term the hierarchical dynamics encoder (HDE). HDEs take advantage of the VAC framework using a neural network architecture and loss function to recover the leading modes of the spectral hierarchy of eigenfunctions of the transfer operator that evolves equilibrium-scaled probability distributions through time. In a nutshell, the HDE discovers a nonlinear featurization of the input basis to pass to the VAC framework for estimation of the leading eigenfunctions of the transfer operator.
This approach shares much technical similarity with, and was in large part inspired by, the elegant VAMPNets approach developed by Noé and coworkers Mardt et al. (2018). Both approaches employ Siamese neural networks to discover nonlinear featurizations of an input basis that are optimized using the VAMP principle. The key distinguishing feature is the objective of the two approaches. VAMPNets seek to replace the entire MSM construction pipeline of featurization, dimensionality reduction, clustering, and construction of a kinetic model. As such, the VAMPNet neural networks employ softmax activations within their output layers to generate
-dimensional output vectors that can be interpreted as probability assignment to each of
metastable states. This network architecture achieves nonlinear featurization of the input basis and soft clustering into metastable states. The output vectors are subsequently optimized using the VAMP principle to furnish a kinetic model over these soft/fuzzy states by approximating the eigenvectors of the transfer operator over the states. In contrast, HDEs do not perform any state space partitioning but rather seek to learn continuous nonlinear functions of the input data to generate a nonlinear basis set with which to approximate the eigenfunctions of the transfer operator. This is achieved using relatively simple neural network architectures with nonlinear or linear activation functions in the output layers and subsequent optimization of the network outputs using the VAMP principle. By eschewing any clustering, the HDE approximations to the transfer operator eigenfunctions are continuous, explicit, and differentiable functions of the input coordinates that can be used to infer the mechanisms of molecular conformational transitions, employed directly in enhanced sampling calculations, and passed to standard MSM construction pipelines to perform microstate clustering and estimation of discrete kinetic models.
The structure of this paper is as follows. We first derive the theoretical foundations of the HDE as a special case of VAC, and then demonstrate its efficacy against kTICA and state-of-the-art TICA-based MSMs in applications to 1D and 2D toy systems where the true eigenfunctions are known and in molecular simulations of alanine dipeptide and WW domain.
We first recapitulate transfer operator theory and the variational approach to conformational dynamics (VAC) Noé and Nuske (2013); Nüske et al. (2014); Schütte et al. (2001); Prinz et al. (2011); Schwantes and Pande (2015), choosing to borrow the notational convention from Ref. Prinz et al. (2011). We then demonstrate how the VAC specializes to TICA, kTICA, and HDEs.
ii.1 Transfer operator theory
Given the probability distribution of a system configuration at time and the equilibrium probability distribution , we define and the transfer operator , known formally as the Perron-Frobenius operator or propagator with respect to the equilibrium density Klus et al. (2018), such that
where is a transition density describing the probability that a system at at time evolves to after a lag time . In general, depends on not only current state at time , but also previous history, and is therefore time dependent. Under the Markovian assumption, which becomes an increasingly good approximation at larger lag times , becomes a time homogeneous transition density independent of and the transfer operator can be written as , where
If the system is at equilibrium, then it additionally obeys detailed balance such that
which demonstrates that is self-adjoint with respect to the inner product
Let be eigenfunctions of
corresponding to the eigenvaluesin non-ascending order
The self-adjoint nature of implies that it possesses real eigenvalues and its eigenvectors form a complete orthonormal basis Wu and Noé (2017); Nüske et al. (2014); Noé and Nuske (2013), with orthonormality relations
Normalization of the transition density together with the assumption of ergodicity implies that the eigenvalue spectrum is bounded from above by a unique unit eigenvalue such that Nüske et al. (2014); Prinz et al. (2011). Any state at a specific time can be written as a linear expansion in this basis of
The evolution of to after a time period can be written as
where is the implied timescale corresponding to eigenfunction given by
This development demonstrates the value of the eigendecomposition of the transfer operator: access to its eigenvalues and eigenvectors enables prediction of the time evolution of any state through Eq. II.1. To approximate the long-time evolution it suffices to have access to the leading eigenfunction-eigenvector pairs.
ii.2 Variational approach to conformational dynamics (VAC)
Under the VAC, we seek an orthonormal set to approximate under the orthogonality conditions given by Eq. 6. Typically we are interested not in the full but only the leading eigenfunctions corresponding to the largest eigenvalues and thus longest implied timescales.
To learn , we note that any state function which is orthogonal to can be expressed as
where are expansion coefficients, and
Since is bounded from above by by the variational principle Nüske et al. (2014); Schwantes and Pande (2015), we can exploit this fact to approximate the first non-trivial eigenfunction by searching for a that maximizes subject to . The learned is an approximation to first non-trivial eigenfunction .
We can continue this procedure to approximate higher order eigenfunctions. In general we approximate by maximizing
under the orthogonality constraints
In essence, the VAC procedure combines a variational principle Nüske et al. (2014) with a linear variational approach perhaps most familiar from quantum mechanics Szabo and Ostlund (2012). Given an arbitrary input basis , the eigenfunction approximations may be written as linear expansions
Here is the (eigen)vector of linear expansion coefficients for the approximate eigenfunction , and is the associated eigenvalue. The spectrum of solutions of Eq. 15 yield the best linear estimations of the eigenfunctions of the transfer operator within the basis . The generalized eigenvalue problem can be solved by standard techniques Watkins (2007).
ii.3 Estimation of VAC equations from trajectory data
where can be estimated from a trajectory . The denominator follows similarly as
The full expression for Eq. 12 becomes
Similarly, Eq. 13 becomes
ii.4 Time-lagged independent component analysis (TICA)
In TICA, we represent as a linear combination of molecular coordinates , where is a vector of linear expansion coefficients and is an additive constant
The orthogonality condition Eq. 13 of relative to becomes
It follows that
and therefore Eq. 24 can be written as
ii.5 Kernel TICA (kTICA)
One way of generalizing TICA to learn nonlinear features is though feature engineering. Specifically, if we can find a nonlinear mapping that maps configurations to appropriate features , we can apply TICA on these features. However, designing good nonlinear features typically requires expert knowledge or expensive data preprocessing techniques. Therefore, instead of finding an explicit mapping , an alternative approach is to apply the kernel trick using a kernel function that defines an inner product between and as a similarity measure in the feature space that does not require explicit definition of .
To apply the kernel trick to TICA, we need to reformulate TICA in terms of this kernel function. It can be shown that in Eq. II.4, the coefficient is linear combination of (see Supplementary Information of Ref. Schwantes and Pande (2015)) and may therefore be written as
To obtain a nonlinear transformation, we replace the linear similarity measure, which is the inner product of two vectors, with a symmetric nonlinear kernel function . This transforms Eq. II.5 and II.5 to
which define the objective function and orthogonality constraints of kTICA. As detailed in Ref. Schwantes and Pande (2015), Eq. II.5 and II.5 can be simplified to a generalized eigenvalue problem that admits efficient solution by standard techniques Watkins (2007).
Although kTICA enables recovery of nonlinear eigenfunctions, it does have some significant drawbacks. First, it has high time and space complexity. The Gram matrix takes time to compute and requires memory, and the generalized eigenvalue problem takes time to solve, which severely limits its application to large datasets. Second, results can be sensitive to the choice of kernel and there exist no rigorous guidelines for the choice of an appropriate kernel for a particular application. This limits the generalizability of the approach. Third, the kernels are typically highly sensitive to the choice of hyperparameters. For example, the Gaussian (radial basis function) kernel is sensitive to noise for small kernel widths, , which leads to overfitting and overestimation the implied timescales. A large on the other hand typically approaches linear TICA results which undermines its capacity to learn nonlinear transformations Harrigan and Pande (2017); Sultan et al. (2018). This hyperparameter sensitivity typically requires signifiant expenditure of computational effort to tune these values in order to obtain satisfactory results Schwantes and Pande (2015); Harrigan and Pande (2017). Fourth, kTICA does not furnish an explicit expression for the mapping projecting configurations into the nonlinear feature space Hernández et al. (2018). Accordingly, it is not straightforward to apply kernel TICA within enhanced sampling protocols that require the learned latent variables to be explicit and differentiable functions of the input coordinates.
One way to ameliorate the first deficiency by reducing memory usage and computation time is to employ a variant of kTICA known as landmark kTICA. This approach selects landmarks from the dataset, computes an -by- Gram matrix, and then uses the Nyström approximation to estimate the original -by- Gram matrix Harrigan and Pande (2017).
ii.6 Hierarchical dynamics encoder (HDE)
We now introduce the HDE approach that employs a neural network to learn nonlinear approximations to directly without requiring the kernel trick. The neural network maps configuration to a -dimensional output , where is the number of slow modes we want to learn. Then a linear variational method is applied to obtain the corresponding such that form an orthonormal basis set that minimizes the network loss function.
The method proceeds as follows. Given a neural network with -dimensional outputs, we feed a training set and train the network with loss function
where is a monotonically decreasing function. Minimizing is equivalent to maximizing the sum over and therefore maximizing the sum over (Eq. 9). The correspond to linear combinations of the neural network outputs
Minimization of the loss function by gradient descent requires the derivative of with respect to neural network parameters
The first two partial derivatives are straightforwardly calculated by automatic differentiation once a choice for has been made. The third partial derivative requires a little more careful consideration. We first expand this final derivative using Eq. 15 to make explicit the dependence of on the matrices and
To our best knowledge, no existing computational graph frameworks provide gradients for generalized eigenvalue problems. Accordingly, we rewrite Eq. 15 as follows. We first apply a Cholesky decomposition to , such that
where is a lower triangular matrix. We then left multiply both sides by to obtain
Defining and we convert the generalized eigenvalue problem into a standard eigenvalue with a symmetric matrix
where the Cholesky decomposition assures numerical stability. Now, Eq. 41 becomes
where all terms are computable using automatic differentiation: from routines for a symmetric matrix eigenvalue problem, and from those for Cholesky decomposition, matrix inversion, and matrix multiplication, and and
by applying the chain rule toEq. 16 and 17 with and computing the derivatives through the neural network.
Training is performed by passing pairs to the HDE and updating the network parameters using mini-batch gradient descent using Adam Kingma and Ba (2014) and employing the automatic differentiation expression for to minimize the loss function. To prevent overfitting, we shuffle all
pairs and reserve a small portion as validation set with which to implement early stopping. Training is terminated when validation loss no longer decreases for a pre-specified number of epochs. A schematic diagram of the HDE is shown inFig. 1.
We note that we do not learn directly in our neural network, but obtain it as a weighted linear combination of . Specifically, during training we learn not only the weights of neural network, but also the linear expansion coefficients that yield from . Conceptually, one can consider the neural network as learning an optimal nonlinear featurization of the input basis to pass through the VAC framework. After training is complete, a new out-of-sample configuration can be passed through the neural network to produce , which is then transformed via a linear operator to get the eigenfunctions estimates . Since the neural network is fully differentiable and the final transformation is linear the HDE mapping is explicit and differentiable, making it well suited to applications in enhanced sampling.
An important choice in our network design is the function within our loss function. In theory it can be any monotonically decreasing function, but motivated by Refs. Mardt et al. (2018) and Wu and Noé (2017), we find that choosing
such that the loss function corresponds to the VAMP-2 score yields good performance and possesses strong theoretical grounding. Specifically, the VAMP-2 score may be interpreted as the cumulative kinetic varianceNoé and Clementi (2015); Husic and Pande (2018)
analogous to the cumulative explained variance in principal component analysisPearson (1901), but where the VAMP-2 score measures kinetic rather than conformational information. The VAMP-2 score can also be considered to measure the closeness of the approximate eigenfunctions to the true eigenfunctions of the transfer operator, with this score achieving its maximum value when the approximations become exact Mardt et al. (2018). This choice may also be generalized to the VAMP-r score Wu and Noé (2017). We have also observed good performance using ), which corresponds to maximizing the sum of implied timescales (Eq. 9).
The HDE learning protocol is quite simple and efficient. It only requires memory and computation time, which makes it ideal for large datasets. It also does not require selection of a kernel function to achieve appropriate nonlinear embeddings Sultan et al. (2018); Hernández et al. (2018) Instead we appeal to the universal approximation theorem Hassoun (1995); Chen and Chen (1995)
, which loosely states that a neural network with more than one hidden layer and enough number of hidden nodes can approximate any nonlinear continuous function without need to impose a kernel. Lifting the requirement for kernel selection and its attendant hyperparameter tuning is a significant advantage of the present approach over kernel-based techniques. Training such a simple neural network is possible using standard techniques such as stochastic gradient descent and is largely insensitive to our choice of network architecture, hyperparameters, and activation functions. Using a default learning rate, large batch size, and sufficiently large network gives excellent results as we demonstrate below. Furthermore, we use the same number of hidden layers, number of nodes in each hidden layer, and hidden layer activation functions for all four applications in this work, demonstrating the simplicity, robustness, and generalizability of the HDE approach.
iii.1 1D 4-well potential
In our first example, we consider a simple 1D 4-well potential defined in Ref. Schwantes and Pande (2015) and illustrated in Fig. 2. The eigenfunctions of the transfer operator for this system are exactly calculable, and this provides a simple initial demonstration of HDEs. We construct a transition matrix as a discrete approximation for the transfer operator by dividing the interval into 100 evenly-spaced bins and computing the transition probability of moving from bin to bin as,
where and are the potential energies at the centers of bins and and is the normalization factor for bin such that the total transition probability associated with bin sums to unity. We then define a unit-time transition matrix , and a transition matrix of lag time as . In this model, we use a lag time for both theoretical analysis and model learning.
By computing the eigenvectors of the transition matrix , we recover the equilibrium distribution . The corresponding eigenfunctions of transfer operator are given by . The top four eigenfunctions are shown in Fig. 3, where first eigenfunction corresponds to trivial stationary transition, and next three to transitions over the three energy barriers.
Using the calculated transition matrix , we generate a 5,000,000-step trajectory over the 1D landscape by initializing the system within a randomly selected bin and then propagating its dynamics forward through time under the action of Schwantes and Pande (2015). The state of the system at any time is represented by the 1D -coordinate of the particle . We then adopt a fixed lag time of
and learn the top three non-trivial eigenfunctions using both kTICA and HDE. We note that for kTICA, we cannot compute the full 5,000,000-by-5,000,000 Gram matrix, so instead we select 200 landmarks by K-means clustering and use the landmark approximation. We select a Gaussian kernel with
. In contrast, the HDE has no difficulty processing all data points. We use two hidden layers with 100 neurons each, giving a final architecture of [1, 100, 100, 3]. The activation function for all layers are selected to be
, and we employ a VAMP-2 loss function. The HDE network is constructed and trained within KerasChollet ; Chollet et al. (2018).
The results for kTICA and HDE are shown in Fig. 4. We find that both methods are in excellent agreement with the theoretical eigenfunctions. The small deviation between the estimated timescales for both methods and the theoretical timescales is a result of noise in the simulated data. This result demonstrates the capacity of HDEs to recover the eigenfunctions of a simple 1D system in quantitative agreement with the theoretical results and excellent correspondence with kTICA.
iii.2 Ring potential
We now consider the more complicated example of a 2D modified ring potential . This potential contains a narrow ring potential valley of 0 and four barriers of heights 1.0 , 1.3 , 0.5 , and 8.0 . The expression of the potential is given by Eq. 47, and it is plotted in Fig. 5
We use the same procedure outlined in the previous example to generate the theoretical eigenfunctions of transfer operator and simulate a trajectory using a Markov state model. The region of interest, , is discretized into 50-by-50 bins. In this model, we use a lag time of for both theoretical analysis and model learning. The transition probability of moving from bin to bin is given by Eq. 48, where is the normalization factor for bin such that the total transition probability associated with bin sums to unity.
Once again we compute the first three non-stationary theoretical eigenfunctions of from the transition matrix and illustrate these in Fig. 6. We then numerically simulate a 5,000,000-step trajectory over the 2D landscape under the action of the transition matrix, and pass these data to a 1000-landmark kTICA model employing a Gaussian kernel and an HDE with the same hidden layer architecture and loss function as the previous example. The state of the system at any time is defined by the 2D (,)-coordinate pair representing the particle location . Again small deviations between the estimated and theoretical timescales should be expected due to noise and finite length of the simulated data.
The kTICA results employing the optimal bandwidth of the Gaussian kernel are shown in Fig. 6. Although it gives a reasonable approximation of the eigenfunctions within the ring where data are densely populated, the agreement outside of the ring is much worse. This is due to the intrinsic limitation of a Gaussian kernel function: a small leads to an accurate representation near landmarks but poor predictions for regions far away, while a large produces better predictions far away at the expense of local accuracy. Moreover, the kTICA results depend sensitively on both the number of landmarks and the bandwidth of the Gaussian kernel. In Fig. 7 we report the test loss given by Eq. 38 on a dynamical trajectory of length 1,000,000 for kTICA models with different kernel bandwidths and numbers of landmarks selected by K-means clustering. The approximations of the leading non-trivial eigenfunctions are reported in Fig. A1 in the Appendix. We note that only when we use a large number of landmarks can we achieve reasonable results, which leads to expensive computations in both landmark selection and calculation of the Gram matrix. Moreover, even with a large number of landmarks, the range of values for which satisfactory results are obtained is still quite small and requires substantial tuning.
In contrast, the HDE with architecture of [2, 100, 100, 3] shows excellent agreement with the theoretical eigenfunctions without any tuning of network architecture, activation functions, or loss function. The HDE eigenvalues closely match those extracted by kTICA, and the HDE eigenfunctions show superior agreement with the theoretical results compared to kTICA. This result demonstrates the capacity of HDEs in leveraging the flexibility and robustness of neural networks in approximating arbitrary continuous function over compact sets.
iii.3 Alanine dipeptide
Having demonstrated HDEs on toy models, we now consider their application to alanine dipeptide in water as a simple but realistic application to molecular data. Alanine dipeptide (N-acetyl-L-alanine-N-methylamide) is a simple 22-atom peptide that stands as a standard test system for new biomolecular simulation methods. The molecular structure of alanine dipeptide annotated with the four backbone dihedral angles that largely dictate its configurational state is presented in Fig. 8. A 200 ns simulation of alanine dipeptide in TIP3P water and modeled using the Amber99sb-ILDN forcefield was conducted at T = 300 K and P = 1 bar using the OpenMM 7.3 simulation suite Eastman et al. (2012, 2017). Lennard-Jones interactions were switched smoothly to zero at a 1.4 nm cutoff, and electrostatics treated using particle-mesh Ewald with a real space cutoff of 1.4 nm and a reciprocal space grid spacing of 0.12 nm. Configurations were saved every 2 ps to produce generate a trajectory comprising 1,000,000 configurations. The instantaneous state of the peptide is represented by the Cartesian coordinates of the 22 atoms , where the influence of the solvent molecules is treated implicitly through their influence on the peptide configuration. In this case the theoretical eigenfunctions of the transfer operator are unavailable, and we instead compare the HDE results against those of kTICA.
The 45 pairwise distances between the 10 heavy atoms were used as features with which to perform kTICA employing employing a Gaussian kernel, 5000 landmarks obtained from K-means clustering, and a lag time of = 20 ps. The intramolecular pairwise distances or contact matrix are favored within biomolecular simulations as an internal coordinate frame representation that is invariant to translational and rotational motions Sittel et al. (2014). The leading three eigenfunctions discovered by kTICA employing a manually tuned kernel bandwidth are shown in Fig. 9 superposed upon the Ramachandran plot in the backbone and torsional angles that are known to be good discriminators of the molecular metastable states Ferguson et al. (2011); Ren et al. (2005); Laio and Parrinello (2002); Barducci et al. (2008); Bolhuis et al. (2000); Valsson and Parrinello (2014); Stamati et al. (2010). The timescales of the 4 and higher order modes lie below the = 20 ps lag time so cannot be resolved by this model. Accordingly, we select three leading slow modes for analysis. From Fig. 9 it is apparent that the first slow mode captures transitions along torsion, the second characterizes the transitions between and basins, and the third motions between the and basins.
The trajectory is then analyzed at the same lag time using a [45, 100, 100, 3] HDE employing the hidden layer architecture and loss function as the previous examples. The HDE eigenfunctions illustrated in Fig. 9 are in excellent agreement with those learned by kTICA, but importantly the implied timescale of the HDE leading mode is 17% slower than that extracted by kTICA. What is the origin of this discrepancy?
The current state-of-the-art methodology to approximate the eigenfunctions of the transfer operator is to construct a Markov state model (MSM) over a microstate decomposition furnished by TICA Pande et al. (2010); Schwantes and Pande (2013); Pande et al. (2010); Husic and Pande (2018); Trendelkamp-Schroer et al. (2015); Sultan and Pande (2017); Mittal and Shukla (2018); Harrigan et al. (2017); Wehmeyer et al. (2018); Scherer et al. (2018) Applying TICA to the 45 pairwise distances between the heavy atoms, we construct a MSM using PyEMMA Scherer et al. (2015), and present the results in Fig. 9. We see that the MSM eigenfunctions are in excellent accord with those learned by HDEs and kTICA, however while HDEs nearly exactly match the implied timescales of the MSM and results reported in Ref. Trendelkamp-Schroer et al. (2015), kTICA substantially underestimates the timescale of the slowest mode.
The underlying reason for this failure is the that spatial resolution in feature space is limited by number of landmarks, and if the Euclidean distance in feature space does not closely correspond to kinetic distances then it requires much finer spatial resolution to resolve the correct slow modes. For alanine dipeptide, pairwise distances between heavy atoms are not well correlated with the slowest dynamical modes – here, rotations around backbone dihedrals – and therefore landmark kTICA models built on these features have difficulty resolving the slowest mode. This issue may be alleviated by employing more landmarks, but it quickly becomes computationally intractable on commodity hardware to use significantly more than approximately 5000 landmarks. Another option is to use features that are better correlated with the slow modes. For alanine dipeptide, it is known that the backbone dihedrals are good features, and if we perform kTICA using these input features we do achieve much better estimates of the implied timescale of the leading mode ( = 1602 ps). In general, however, a good feature set is not known a priori, and for poor choices it is typically not possible to obtain good performance even for large numbers of landmarks.
Our numerical investigations also show the implied timescales extracted by HDEs to be very robust to the particular choice of lag time. The reliable inference of implied timescales from MSMs requires that they be converged with respect to lag time, and slow convergence presents an impediment to the realization of high-time resolution MSMs. The yellow bars in Fig. 10 present the implied time scales of the leading three eigenfunctions computed from the mean over five HDEs constructed with lag times of = 10, 40, and 100 ps. It is clear that the implied timescales are robust to the choice of lag time over a relatively large range.
Given the implied timescales evaluated at four different lag times – = 10, 20, 40, and 100 ps – this provides an opportunity to assess the dynamical robustness of HDEs by subjecting them to a variant of the Chapman-Kolmogorov test employed in the construction and validation of MSMs Mardt et al. (2018). As discussed in Section. II.1, the VAC framework is founded on the Markovian assumption that the transfer operator is homogeneous in time. In previous two toy examples, this assumption holds by construction due to the way the trajectory data were generated. However, for a real system like alanine dipeptide, there is no such guarantee. Here we test a necessary condition for the Markovian assumption to hold. In general we have
If the Markovian assumption holds, then the transfer operators are independent of , such that
The corresponding eigenvalue and eigenfunctions are
where and are the estimated eigenvalues and eigenfunctions of , and and those of . Appealing to Eq. 50, it follows that
providing a means to compare the consistency of HDEs constructed at different choices of . In particular, the implied timescales for the eigenfunctions of estimated from an HDE constructed at a lag time
should be well approximated by those estimated from an HDE constructed at a lag time
If this is not the case, then the assumption of Markovianity is invalidated for this choice of lag time.
We present in Fig. 10 the predicted implied timescales over the range of lag times = 2–120 ps calculated from an HDE constructed at a lag time of = 20 ps. These predictions are in excellent accord to the implied timescales directly computed from HDEs constructed at lag times of = 10, 40, and 100 ps, demonstrating satisfaction of the Chapman-Kolmogorov test and a demonstration of the Markovian property of the system at lag times ps Mardt et al. (2018); Wehmeyer et al. (2018); Chodera et al. (2006); Pande et al. (2010); Husic and Pande (2018).
iii.4 WW domain
Our final example considers a 1137 simulation of the folding dynamics of the 35-residue WW domain protein performed in Ref. Lindorff-Larsen et al. (2011). We use 595 pairwise distances of all C atoms to train a TICA-based MSM and a [595, 100, 100, 2] HDE with the same hidden layer architecture and loss function as all previous examples. We use lag time of = 400 ns (2000 steps) for both models, and focus on the two leading slowest modes. The implied timescales of higher-order modes lie close to or below the lag time and so cannot be reliably resolved by this model. The slow modes discovered by the TICA-based MSM and the HDE are shown in Fig. 11a projected onto the two leading TICA eigenfunctions (tIC, tIC) to furnish a consistent basis for comparison. The MSM and HDE eigenfunctions are in excellent agreement, exhibiting Pearson correlation coefficients of = 0.99, 0.98 for the top two modes respectively. The implied timescales inferred by the two methods are also in good agreement.
An important aspect of model quality is the convergence rate of the implied timescales with respect to the lag time . The lag time must be selected to be sufficiently large such that the state decomposition is Markovian whereby dynamical mixing with states is faster than the lag time and interconversion between states is slower, but it is desirous that the lag time be as short as possible to produce a model with high time resolution Pande et al. (2010). Better approximations for the leading eigenfunctions of the transfer operator typically lead to convergence of the implied timescales at shorter lag times. We construct 10 independent HDE models and 10 independent TICA-based MSMs over the WW domain trajectory data and report in Fig. 11b
the mean and 95% confidence intervals of the implied timescale convergence with lag time. The HDE exhibits substantially faster convergence than the TICA-based MSM, particularly in the second eigenfunction. This suggests that the eigenfunctions identified by the HDE, although strongly linearly correlated with those identified by the TICA-based MSM, provide a better approximation to the leading slow modes of the transfer operator and produce a superior state decomposition. We attribute this observation to the ability of the HDE to capture complex nonlinear relationships in the data within a continuous state representation, whereas the MSM is dependent upon the TICA coordinates which are founded on a linear variational approximation to the transfer operator eigenfunctions that subsequently inform the construction of an inherently discretized microstate transition matrix from which we compute the MSM eigenvectors. This result demonstrates the viability of HDEs to furnish a high-resolution model of the slow system dynamics without the need to perform any system discretization and at a higher time resolution (i.e., lower lag time) than is possible with the current state-of-the-art TICA-based MSM protocol.
In this work we proposed a new framework that we term the hierarchical dynamics encoder (HDE) for the discovery of a hierarchy of nonlinear slow modes of a dynamical system from trajectory data. The framework is built on top of transfer operator theory that uses a flexible neural network to learn an optimal nonlinear basis from the input representation of the system. Compared to kernel TICA and variational dynamics encoders, our HDE framework has many advantages. It is capable of simultaneously learning an arbitrary number of eigenfunctions while guaranteeing orthogonality. It also requires memory and computation time, which makes it amenable to large data sets such as those commonly encountered in biomlecular simulations. The neural network architecture does not require the selection of a kernel function or adjustment of hyperparameters that can strongly affect the quality of the results and be tedious and challenging to tune Sultan et al. (2018); Hernández et al. (2018)
. In fact, we find that training such a simple fully-connected feed-forward neural network is simple, cheap, and insensitive to batch size, learning rate, and architecture. Finally, the HDE is a parametric model, which provides an explicit and differentiable mapping from configurationto the learned approximations of the leading eigenvectors of the transfer operator . These slow collective variables are then ideally suited to be utilized in collective variable-based enhanced sampling methods.
The HDE framework possesses a close connection with a number of existing methodologies. In the one-dimensional limit, HDEs are formally equivalent to variational dynamics encoders (VDEs) with an exclusive autocorrelation loss, subject to Gaussian noise Hernández et al. (2018). VDEs however, cannot currently generalize to multiple dimensions due to the lack of an orthogonality constraint on the learned eigenfunctions. By using the more general VAMP principle for non-reversible processes and targeting membership state probabilities rather than learning continuous functions, VAMPNets are obtained Mardt et al. (2018).
In regards to the analysis of molecular simulation trajectories, we anticipate that the flexibility and high time-resolution of HDE models will be of use in helping resolve and understand the important long-time conformational changes governing biomolecular folding and function. Moreover, it is straightforward to replace TICA-based MSMs with HDE-based MSMs, maintaining the large body of theoretical and practical understanding of MSM construction while delivering the advantages of HDEs in improved approximation of the slow modes and superior microstate decomposition. In regards to enhanced sampling in molecular simulation, the differentiable nature of HDE coordinates naturally enables biasing along the HDE collective variables (CVs) using well-established accelerated sampling techniques such as umbrella sampling, metadynamics, and adaptive biasing force. The efficiency of these techniques depends crucially on the choice of “good” CVs coincident with the important underlying dynamical modes governing the ling-time evolution of the system. A number of recent works have employed neural networks to learn nonlinear CVs describing the directions of highest variance within the data Chen and Ferguson (2017); Chen et al. (2018); Ribeiro et al. (2018); Ribeiro and Tiwary (2018). However, the high variance directions are not guaranteed to also correspond to the slow directions of the dynamics. Only variational dynamics encoders have been used to learn and bias sampling in a slow CV Sultan et al. (2018), but, as observed above, the VDE is limited to approximate only the leading eigenfunction of the transfer operator. HDEs open the door to performing accelerated sampling within the full spectrum of all relevant eigenfunctions of the transfer operator. In a similar vein, HDEs may also be profitably incorporated into adaptive sampling approaches that do not apply artificial biasing force, but rather smartly initialize short unbiased simulations on the edge of the explored domain Shamsi et al. (2018); Preto and Clementi (2014); Weber and Pande (2011); Zimmerman and Bowman (2015); Bowman et al. (2010); Hinrichs and Pande (2007); Doerr and De Fabritiis (2014). The dynamically meaningful projections of the data into the HDE collective variables is anticipated to better resolve the dynamical frontier than dimensionality reduction approaches based on maximal preservation of variance, and therefore better direct sampling along the slow conformational pathways. In sum, we expect HDEs to have a variety of applications not just in the context of molecular simulations, but also more broadly within the analysis of dynamical systems.
This material is based upon work supported by the National Science Foundation under Grant No. CHE-1841805. H.S. acknowledges support from the Molecular Software Sciences Institute (MolSSI) Software Fellows program (NSF grant ACI-1547580)Krylov et al. (2018); Wilkins-Diehr and Crawford (2018). We are grateful to D.E. Shaw Research for sharing the WW domain simulation trajectories.
Software to perform slow modes discovery along with an API compatible with scikit-learn and Keras, Jupyter notebooks to reproduce the results presented in this paper, documentation, and examples have been freely available under MIT license at https://github.com/hsidky/hde.
- Noé and Nuske (2013) F. Noé and F. Nuske, Multiscale Modeling & Simulation 11, 635 (2013).
- Nüske et al. (2014) F. Nüske, B. G. Keller, G. Pérez-Hernández, A. S. J. S. Mey, and F. Noé, Journal of Chemical Theory and Computation 10, 1739 (2014).
- Noé and Clementi (2017) F. Noé and C. Clementi, Current Opinion in Structural Biology 43, 141 (2017).
- Klus et al. (2018) S. Klus, F. Nüske, P. Koltai, H. Wu, I. Kevrekidis, C. Schütte, and F. Noé, Journal of Nonlinear Science 28, 985 (2018).
- Pérez-Hernández et al. (2013) G. Pérez-Hernández, F. Paul, T. Giorgino, G. De Fabritiis, and F. Noé, The Journal of Chemical Physics 139, 07B604_1 (2013).
- Noé and Clementi (2015) F. Noé and C. Clementi, Journal of Chemical Theory and Computation 11, 5002 (2015).
- Noé et al. (2016) F. Noé, R. Banisch, and C. Clementi, Journal of Chemical Theory and Computation 12, 5620 (2016).
- Pérez-Hernández and Noé (2016) G. Pérez-Hernández and F. Noé, Journal of Chemical Theory and Computation 12, 6118 (2016).
- Schwantes and Pande (2013) C. R. Schwantes and V. S. Pande, Journal of Chemical Theory and Computation 9, 2000 (2013).
- Husic and Pande (2018) B. E. Husic and V. S. Pande, Journal of the American Chemical Society 140, 2386 (2018).
- Schwantes and Pande (2015) C. R. Schwantes and V. S. Pande, Journal of Chemical Theory and Computation 11, 600 (2015).
- Harrigan and Pande (2017) M. P. Harrigan and V. S. Pande, bioRxiv , 123752 (2017).
- M. Sultan and Pande (2017) M. M. Sultan and V. S. Pande, Journal of Chemical Theory and Computation (2017).
- Hassoun (1995) M. H. Hassoun, Fundamentals of Artificial Neural Networks (MIT press, 1995).
- Chen and Chen (1995) T. Chen and H. Chen, IEEE Transactions on Neural Networks 6, 911 (1995).
- Bastien et al. (2012) F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. Goodfellow, A. Bergeron, N. Bouchard, D. Warde-Farley, and Y. Bengio, arXiv preprint arXiv:1211.5590 (2012).
- Baydin et al. (2018) A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind, Journal of Marchine Learning Research 18, 1 (2018).
- Paszke et al. (2017) A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, (2017).
- Schmidhuber (2015) J. Schmidhuber, Neural networks 61, 85 (2015).
- Wehmeyer and Noé (2018) C. Wehmeyer and F. Noé, The Journal of Chemical Physics 148, 241703 (2018).
- Hernández et al. (2018) C. X. Hernández, H. K. Wayment-Steele, M. M. Sultan, B. E. Husic, and V. S. Pande, Physical Review E 97, 062412 (2018).
- Sultan et al. (2018) M. M. Sultan, H. K. Wayment-Steele, and V. S. Pande, Journal of Chemical Theory and Computation 14, 1887 (2018).
- Mardt et al. (2018) A. Mardt, L. Pasquali, H. Wu, and F. Noé, Nature Communications 9, 5 (2018).
- Schütte et al. (2001) C. Schütte, W. Huisinga, and P. Deuflhard, in Ergodic Theory, Analysis, and Efficient Simulation of Dynamical Systems (Springer, 2001) pp. 191–223.
- Prinz et al. (2011) J.-H. Prinz, H. Wu, M. Sarich, B. Keller, M. Senne, M. Held, J. D. Chodera, C. Schütte, and F. Noé, The Journal of Chemical Physics 134, 174105 (2011).
- Wu and Noé (2017) H. Wu and F. Noé, arXiv preprint arXiv:1707.04659 (2017).
- Szabo and Ostlund (2012) A. Szabo and N. S. Ostlund, Modern Quantum Chemistry: Introduction to Advanced Electronic Structure Theory (Courier Corporation, 2012).
- Watkins (2007) D. S. Watkins, The Matrix Eigenvalue Problem: GR and Krylov Subspace Methods, Vol. 101 (Siam, 2007).
- Kingma and Ba (2014) D. P. Kingma and J. Ba, arXiv preprint arXiv:1412.6980 (2014).
- Pearson (1901) K. Pearson, The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, 559 (1901).
- (31) F. Chollet, “Keras,” https://github.com/fchollet/keras.
- Chollet et al. (2018) F. Chollet et al., Astrophysics Source Code Library (2018).
- Eastman et al. (2012) P. Eastman, M. S. Friedrichs, J. D. Chodera, R. J. Radmer, C. M. Bruns, J. P. Ku, K. A. Beauchamp, T. J. Lane, L.-P. Wang, and D. Shukla, Journal of Chemical Theory and Computation 9, 461 (2012).
- Eastman et al. (2017) P. Eastman, J. Swails, J. D. Chodera, R. T. McGibbon, Y. Zhao, K. A. Beauchamp, L.-P. Wang, A. C. Simmonett, M. P. Harrigan, and C. D. Stern, PLOS Computational Biology 13, e1005659 (2017).
- Sittel et al. (2014) F. Sittel, A. Jain, and G. Stock, The Journal of Chemical Physics 141, 07B605_1 (2014).
- Ferguson et al. (2011) A. L. Ferguson, A. Z. Panagiotopoulos, P. G. Debenedetti, and I. G. Kevrekidis, The Journal of Chemical Physics 134, 04B606 (2011).
- Ren et al. (2005) W. Ren, E. Vanden-Eijnden, P. Maragakis, and W. E, The Journal of Chemical Physics 123, 134109 (2005).
- Laio and Parrinello (2002) A. Laio and M. Parrinello, Proceedings of the National Academy of Sciences 99, 12562 (2002).
- Barducci et al. (2008) A. Barducci, G. Bussi, and M. Parrinello, Physical Review Letters 100, 020603 (2008).
- Bolhuis et al. (2000) P. G. Bolhuis, C. Dellago, and D. Chandler, Proceedings of the National Academy of Sciences 97, 5877 (2000).
- Valsson and Parrinello (2014) O. Valsson and M. Parrinello, Physical Review Letters 113, 090601 (2014).
- Stamati et al. (2010) H. Stamati, C. Clementi, and L. E. Kavraki, Proteins: Structure, Function, and Bioinformatics 78, 223 (2010).
- Humphrey et al. (1996) W. Humphrey, A. Dalke, and K. Schulten, Journal of Molecular Graphics 14, 33 (1996).
- Pande et al. (2010) V. S. Pande, K. Beauchamp, and G. R. Bowman, Methods 52, 99 (2010).
- Trendelkamp-Schroer et al. (2015) B. Trendelkamp-Schroer, H. Wu, F. Paul, and F. Noé, The Journal of Chemical Physics 143, 11B601_1 (2015).
- Sultan and Pande (2017) M. M. Sultan and V. S. Pande, The Journal of Physical Chemistry B (2017).
- Mittal and Shukla (2018) S. Mittal and D. Shukla, Molecular Simulation 44, 891 (2018).
- Harrigan et al. (2017) M. P. Harrigan, M. M. Sultan, C. X. Hernández, B. E. Husic, P. Eastman, C. R. Schwantes, K. A. Beauchamp, R. T. McGibbon, and V. S. Pande, Biophysical journal 112, 10 (2017).
- Wehmeyer et al. (2018) C. Wehmeyer, M. K. Scherer, T. Hempel, B. E. Husic, S. Olsson, and F. Noé, Living Journal of Computational Molecular Science 1, 5965 (2018).
- Scherer et al. (2018) M. K. Scherer, B. E. Husic, M. Hoffmann, F. Paul, H. Wu, and F. Noé, arXiv preprint arXiv:1811.11714 (2018).
- Scherer et al. (2015) M. K. Scherer, B. Trendelkamp-Schroer, F. Paul, G. Pérez-Hernández, M. Hoffmann, N. Plattner, C. Wehmeyer, J.-H. Prinz, and F. Noé, Journal of Chemical Theory and Computation 11, 5525 (2015).
- Chodera et al. (2006) J. D. Chodera, W. C. Swope, J. W. Pitera, and K. A. Dill, Multiscale Modeling & Simulation 5, 1214 (2006).
- Lindorff-Larsen et al. (2011) K. Lindorff-Larsen, S. Piana, R. O. Dror, and D. E. Shaw, Science 334, 517 (2011).
- Chen and Ferguson (2017) W. Chen and A. L. Ferguson, arXiv preprint arXiv:1801.00203 (2017).
- Chen et al. (2018) W. Chen, A. R. Tan, and A. L. Ferguson, The Journal of Chemical Physics 149, 072312 (2018).
- Ribeiro et al. (2018) J. M. L. Ribeiro, P. Bravo, Y. Wang, and P. Tiwary, The Journal of Chemical Physics 149, 072301 (2018).
- Ribeiro and Tiwary (2018) J. M. L. Ribeiro and P. Tiwary, Journal of Chemical Theory and Computation (2018).
- Shamsi et al. (2018) Z. Shamsi, K. J. Cheng, and D. Shukla, The Journal of Physical Chemistry B 122, 8386 (2018).
- Preto and Clementi (2014) J. Preto and C. Clementi, Physical Chemistry Chemical Physics 16, 19181 (2014).
- Weber and Pande (2011) J. K. Weber and V. S. Pande, Journal of Chemical Theory and Computation 7, 3405 (2011).
- Zimmerman and Bowman (2015) M. I. Zimmerman and G. R. Bowman, Journal of Chemical Theory and Computation 11, 5747 (2015).
- Bowman et al. (2010) G. R. Bowman, D. L. Ensign, and V. S. Pande, Journal of Chemical Theory and Computation 6, 787 (2010).
- Hinrichs and Pande (2007) N. S. Hinrichs and V. S. Pande, The Journal of Chemical Physics 126, 244101 (2007).
- Doerr and De Fabritiis (2014) S. Doerr and G. De Fabritiis, Journal of Chemical Theory and Computation 10, 2064 (2014).
- Krylov et al. (2018) A. Krylov, T. L. Windus, T. Barnes, E. Marin-Rimoldi, J. A. Nash, B. Pritchard, D. G. Smith, D. Altarawy, P. Saxe, C. Clementi, et al., The Journal of Chemical Physics 149, 180901 (2018).
- Wilkins-Diehr and Crawford (2018) N. Wilkins-Diehr and T. D. Crawford, Computing in Science & Engineering 20, 26 (2018).