Variational Selection of Features for Molecular Kinetics

by   Martin K. Scherer, et al.
Freie Universität Berlin

The modeling of atomistic biomolecular simulations using kinetic models such as Markov state models (MSMs) has had many notable algorithmic advances in recent years. The variational principle has opened the door for a nearly fully automated toolkit for selecting models that predict the long-time kinetics from molecular dynamics simulations. However, one yet-unoptimized step of the pipeline involves choosing the features, or collective variables, from which the model should be constructed. In order to build intuitive models, these collective variables are often sought to be interpretable and familiar features, such as torsional angles or contact distances in a protein structure. However, previous approaches for evaluating the chosen features rely on constructing a full MSM, which in turn requires additional hyperparameters to be chosen, and hence leads to a computationally expensive framework. Here, we present a method to optimize the feature choice directly, without requiring the construction of the final kinetic model. We demonstrate our rigorous preprocessing algorithm on a canonical set of twelve fast-folding protein simulations, and show that our procedure leads to more efficient model selection.



There are no comments yet.


page 5

page 7

page 10

page 11


VAMPnets: Deep learning of molecular kinetics

Here we develop a deep learning framework for molecular kinetics from mo...

Collective variable discovery in the age of machine learning: reality, hype and everything in between

Understanding kinetics and thermodynamics profile of biomolecules is nec...

Variational Koopman models: slow collective variables and molecular kinetics from short off-equilibrium simulations

Markov state models (MSMs) and Master equation models are popular approa...

Understanding Protein Dynamics with L1-Regularized Reversible Hidden Markov Models

We present a machine learning framework for modeling protein dynamics. O...

Transferable neural networks for enhanced sampling of protein dynamics

Variational auto-encoder frameworks have demonstrated success in reducin...

Automated design of collective variables using supervised machine learning

Selection of appropriate collective variables for enhancing sampling of ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The first step in analyzing the states, equilibrium behavior, or kinetics of complex molecules based on molecular dynamics (MD) simulations is typically the choice of a suitable set of features describing the atomic configurations. This choice is particularly important when the goal is to compute kinetic quantities, such as transition rates or committor probabilities, as these quantities are sensitive to resolving the transitions between the long-lived (metastable) states 

Hänggi and Talkner (1983); Shalloway (1996); Du et al. (1998); Rohrdanz, Zheng, and Clementi (2013); Prinz, Chodera, and Noé (2014); Noé and Clementi (2017). Typically, these transformations input the raw Cartesian coordinates produced from the MD simulation and output a new set of coordinates that is translation- and rotation-invariant. In proteins, biologically-motivated choices include the backbone dihedral angles (torsions), or the pairwise distances between all amino acid residues taken between -carbons or the closest pairs of (heavy) atoms. Many such coordinate sets for proteins are imaginable, such as further transformations of the aforementioned contact distances, the solvent-accessible surface area (SASA) of protein residues or other groups, or sidechain torsion angles.

A number of kinetic analysis frameworks critically depend on the set of features used as input, in particular Markov state models (MSMs) Schütte et al. (1999); Swope, Pitera, and Suits (2004); Singhal, Snow, and Pande (2004); Noé et al. (2007); Chodera et al. (2007); Prinz et al. (2011), master equation models Sriraman, Kevrekidis, and Hummer (2005); Buchete and Hummer (2008), diffusion maps Rohrdanz et al. (2011), and methods to select optimal reaction coordinates Peters (2006); E and Vanden-Eijnden (2006). The analysis framework employed affects what is a meaningful definition of optimality that can be used to select input features.

Here we discuss optimal feature selection in particular with MSMs in mind. In the first generation of MSMs, whose aim was to construct a Markov chain between a few metastable states that partition state space 

Schütte et al. (1999); Swope, Pitera, and Suits (2004); Noé et al. (2007); Chodera et al. (2007), it was already noted that the accuracy with which MSMs could make long-time predictions depended critically on the choice of features. Utilized featurization methods included torsion angles Hummer (2005); Chodera et al. (2006); Noé et al. (2007); Kube and Weber (2007), principal components in torsion or Cartesian coordinates Altis et al. (2007); Noé et al. (2009), contact pairs Jayachandran et al. (2006); Yang and Roux (2008), as well as other transformations involving such attributes as secondary structure Buchete and Hummer (2008); Muff and Caflisch (2008)

. From these representations, clustering methods can be used to determine the MSM states, which can be subsequently checked for adherence to the Markovian approximation. The second generation of MSMs is characterized by the finding that the predictive power of an MSM depends upon its states being chosen such that a good discretization of the eigenfunctions of the Markov operator is obtained 

Sarich, Noé, and Schütte (2010); Prinz et al. (2011). These eigenfunctions are collective variables that indicate the rare-event processes connecting the metastable states, and the task of discretizing them well translates into the task of using input features that allow to resolve them.

At this time, we mainly had visual diagnostic tools to assess the predictive performance of MSMs, such as the implied timescales test Swope, Pitera, and Suits (2004) and the Chapman-Kolmogorow test Prinz et al. (2011). The choice of the hyperparameters of MSMs, such as number of states and the set of input features, remained a trial and error procedure. This changed in 2013, when Noé and Nüske presented a variational principle that quantifies how well a given set of coordinates, features or a given MSM resolve the Markov operator eigenfunctions, and thus the slowest processes Noé and Nüske (2013); Nüske et al. (2014)

. This variational approach to conformational dynamics (VAC) has been highly developed in the past five years: time-lagged independent component analysis (TICA), an algorithm devised in machine learning 

Molgedey and Schuster (1994), has been shown to be the optimal linear approximator to the Markov operator eigenfunctions Pérez-Hernández et al. (2013); an embedding of the eigenfunctions approximated by VAC or TICA into a kinetic map has been proposed, in which distance are related to transition times Noé and Clementi (2015); Noé, Banisch, and Clementi (2016); and hierarchical Pérez-Hernández and Noé (2016)

and kernel-based estimators 

Schwantes and Pande (2015); Harmeling et al. (2003) have been proposed, and methods to estimate VAC/TICA from short off-equilibrium trajectories Wu et al. (2017). Recently, VAC has been generalized to the variational approach for Markov processes (VAMP), which can accommodate nonreversible dynamics Wu and Noé (2017).

A key insight is that the variational principle defines a score which can be used with standard machine learning approaches for hyperparameter selection of MSMs. This was pioneered in Ref. McGibbon and Pande, 2015

which shows that using VAC with cross-validation is a tool for selecting the statistically optimal number of MSM states. Using a VAC-derived kinetic variance as a score, optimal feature selection was discussed in Refs. 

Scherer et al., 2015. The entire MSM pipeline was subject to VAC-optimization in Ref. Husic et al., 2016, revealing general trends in what makes a good MSM for fast protein folding. VAMP was used for hyperparameter optimization Wu and Noé (2017); Mardt et al. (2018)

, and in order to define the loss functions for VAMPnets, a deep learning method to infer MSMs from data 

Mardt et al. (2018).

The construction of an MSM involves (a) choosing an appropriate transformation to collective variables referred to as features, (b) optionally performing a basis set transformation using TICA (or, alternatively, stopping here and using the TICA result as the kinetic model), (c) decomposing the transformed trajectories into states, and (d) approximating a Markovian transition matrix from the state decomposition. Accounting for the relatively established use of just a few clustering algorithms for the state decomposition Scherer et al. (2015); Husic et al. (2016), steps (b) and (c) have largely been automated using the VAC—and, more recently, the VAMP. However, step (a) is difficult to automate using current methods. In contrast, it is straightforward to select a random number of states, construct a few hundred cross-validated MSMs, and identify which number of states achieves the highest VAMP score. While this can in principle also be done for collective variable transformations as well, repeated construction of the entire MSM pipeline for a variety of input features becomes computationally extremely demanding. A complete search of a hyperparameter space (features, TICA dimension, clustering method and number of clusters, etc.) is clearly unfeasible. Hence a variationally optimal method for feature selection that does not require going through the additional steps and choices of building an MSM would be very useful.

Motivated by these difficulties, in this paper we describe an approach that introduces a theoretically rigorous method for the the choice of input features and an accompanying algorithm which enables the researcher to quantify this choice. This could in principle also be used to automate feature selection. We study this approach by applying the method to a canonical dataset of twelve fast-folding proteins simulated near their experimental melting temperatures Kubelka, Hofrichter, and Eaton (2004); Lindorff-Larsen et al. (2011). These systems switch between the folded and unfolded state rapidly, and each trajectory dataset contains at least 10 instances of both folding and unfolding. This dataset, which contains fast folding proteins possessing a variety of secondary structure combinations, has been frequently used in its entirety to investigate methods advances Beauchamp et al. (2012); Dickson and Brooks III (2013); Weber, Jack, and Pande (2013); Husic et al. (2016). After evaluating features on all twelve proteins, we build representative MSMs to illustrate an example analysis.

Associated code is available at The computation of VAMP scores is implemented in the VAMP estimator method of the PyEMMA software package Scherer et al. (2015), which can be found at

Ii Theory

Here we summarize the necessary theory underlying VAMP-based feature selection, starting with MSMs and the VAC before continuing on to the Koopman matrix and the VAMP used in this work. The more practically inclined reader may proceed to the Sec. III—the bottom line of this section is that the VAMP score of a given set of features can be computed by implementing equations (12-21). For a more detailed theoretical discussion, we refer the reader to Refs. Wu et al., 2017Wu and Noé, 2017, and the publication “Identification of kinetic order parameters for non-equilibrium dynamics” by Paul et al. in this issue.

ii.1 The Markov state model transition matrix

MSMs model dynamics as a coarse-grained Markov chain on sets that partition the state space. For this discussion, we assume a single long trajectory, although the method may be applied to a distributed set of independent trajectories. The Markov chain is described by the conditional transition probabilities,


where represents the probability that the system is in set at time conditioned upon it being in set at time . These probabilities can be estimated from trajectories initiated from local equilibrium distributions, and do not require that the system is in global equilibrium Prinz et al. (2011).

The conditional transition probabilities are gathered in a square probability transition matrix, where each entry represents the conditional probability of transitioning from the set described by row index to the set described by column index at the defined lag time . When a sufficiently long lag time and adequate sets have been chosen, the dynamics can be approximated as Markovian, and the are independent of the history of the system. Thus, many independent trajectories obtained from distributed simulations can be threaded together through common sets (henceforth, “states”).

Once a simulation dataset has been divided into states, the

estimates are obtained from an analysis of observed transition counts. The observed transition counts are converted into conditional transition probabilities such that the dynamics are reversible. For MSM analysis, the eigendecomposition of the reversible transition matrix contains information about the thermodynamics and dynamics of the modeled system. Its stationary distribution is given by the eigenvector corresponding to the unique maximum eigenvalue of

. The remaining eigenvalues, which are restricted to the interval for , correspond to dynamical processes within the system, and define timescales as a function of the eigenvalue and the MSM lag time:


where is the MSM approximation to the dynamical propagator at lag time , and the absolute values are used by convention to avoid the imaginary timescales resulting from projection of the system dynamics.

To test whether the Markovian assumption is appropriate for an MSM approximated at a given lag time, the timescales can be plotted as a function of increasing lag time to observe if they have converged to an approximately constant value at the lag time of the estimator; this is referred to as validating the model’s implied timescales. Evaluation of the implied timescales is special case of a more general validation tool, the Chapman-Kolmogorov test, which evaluates the appropriateness of the Markovian assumption according to adherence to the propertyPrinz et al. (2011),


ii.2 The variational approach for conformational dynamics (VAC)

The eigenvalues and eigenvectors of the transition probability matrix correspond to the stationary and dynamical processes in the system, and these quantities can be used to interpret MD simulation data once states have been determined and the MSM is constructed Prinz et al. (2011); Husic and Pande (2018). The discrete eigenvectors are in fact approximations to continuous eigenfunctions, and the transition probability matrix is a finite-dimensional approximation to an infinite-dimensional continuous linear operator called the transfer operator, which describes the system dynamics Schütte et al. (1999); Prinz et al. (2011). In an MSM the system is described by a disjoint set of discrete states, which can be represented by a basis set of indicator functions , i.e.,


Importantly, the VAC generalizes beyond MSMs and does not require an indicator basis set Noé and Nüske (2013); Nüske et al. (2014). Rather, the VAC can be applied to an arbitrary basis set . Given a basis set, we then transform all observed in the dataset and estimate two covariance matrices from the transformed data: , the covariance matrix, and , the time-lagged covariance matrix111For more details on this estimation, see Ref. Wu et al., 2017, §II C.,


where the

are matrices containing the feature vectors

, and the subscript on the expected value indicates that is sampled from the stationary distribution on the interval , for total time.

With these estimates, we can proceed to solve the generalized eigenvalue problem,


where , and the matrix provides vectors of expansion coefficients , which can be substituted into the ansatz to obtain the approximated eigenfunctions  Noé and Nüske (2013); Nüske et al. (2014); Wu et al. (2017):


Since the ansatz is linear in the vector , characterizes the linear VAC. The expansion coefficients are chosen such that the maximize the Rayleigh trace,


where is the Kronecker delta. The definition of the score turns MSM estimation into a machine learning problem, where tools such as cross-validation can be used to determine hyperparameters McGibbon and Pande (2015).

ii.3 Estimating Koopman matrix and VAMP score

In Ref. Wu and Noé, 2017, a new variational approach is introduced that does not require the operator it approximates to be reversible, nor does it require simulation data that are in equilibrium. The operator is the Koopman operator, which is approximated by the Koopman matrix. The corresponding generalized master equation is given by,


where and are matrices storing the feature transformations and , respectively.

Wu and Noé (2017) show that optimal choices for and

can be determined using the singular value decomposition (SVD) of the Koopman matrix and setting

and to its top left and right singular functions, respectively222Ref. Wu and Noé, 2017, Thm. 1.. For an MSM, and are basis sets of indicator functions as defined in Eqn. (4), and Eqn. (10) is equivalent to Eqn. (1)333To see the equivalence, as an intermediate step write ..

A new variational principle—the VAMP—can then be applied to approximate the singular values from the singular functions, by maximizing the Rayleigh trace as in Eqn. (9) for the dominant singular values, except without requiring ,


where and perform the expected value over the starting points of the time windows and , respectively (which is no longer required to represent a stationary sample) and is the total time.

The VAMP was designed to be amenable to nonreversible processes; therefore it is useful to permit and to be different. This enables an adapted description of the dynamics that is different at the beginning and end of a transition of duration , because the system may have changed to the extent that it makes sense to adapt a new basis for the system after the lag time. Although we typically have stationary dynamics in MD datasets, we use the VAMP because we do not need to enforce reversibility in the dynamics as in the VAC, nor do we need to perform the statistically unfavorable reweighting 444While the Koopman reweighting estimator introduced in Ref. Wu et al., 2017 removes bias, it has a relatively large variance. described in Ref. Wu et al., 2017. For the VAMP, we require the following three covariance matrices,


In order to perform the approximation of , we require and to be invertible; thus, the basis set must be decorrelated, or whitened, before proceeding555See Ref. Wu and Noé, 2017, Appendix F for details.. Once we have obtained decorrelated basis sets, we can perform the approximation,


with (the first singular values), and contain the corresponding left and right singular vectors, and the bars indicate matrices that have been produced from whitened data 666When the whitening as suggested in Ref. Wu and Noé, 2017 is used, and become identity matrices when calculated from the whitened data. However, other whitening procedures (or their omission when and are already invertible) will produce different results, hence our more general formulation..

The matrices and are used to calculate the optimal and as follows:


where is the whitened basis,


From we can write the VAMP- score:


Previously, in the VAC, we summed the dominant eigenvalues of the MSM transition matrix to obtain the model’s score, which is variationally bounded from above. In a subsequent work, Noé and Clementi (2015) demonstrated that the sum of the squared eigenvalues is also variationally bounded from above, and can be maximized to obtain the kinetic variance described by the approximator. In fact, any nonincreasing weight can be applied to the eigenvalues and the variational principle will still hold Gross, Oliveira, and Kohn (1988). Thus, we can sum the highest singular values, each raised to an exponent , to obtain the VAMP- score, where VAMP is the variational approach for Markov processes Wu and Noé (2017). VAMP- is analogous to the Rayleigh trace, and VAMP- is analogous to the kinetic variance introduced in Ref. Noé and Clementi, 2015.

Iii Methods

Figure 1: Representative structures for twelve fast-folding proteins simulated by Lindorff-Larsen et al. (2011). When the trajectory input file corresponded to the folded state, this structure was used. Otherwise, a folded structure was hand-selected from the trajectory dataset in order to represent a naïve alignment choice that precedes analysis. In some cases, disordered tails of the pictured proteins are not shown. These folded states were used for the aligned Cartesian coordinate features.
Figure 2: Overview of protocol. From the MD simulations, we extract a set of features. In the feature space we compute a VAMP score to obtain a measure for the model approximation quality. Further steps towards a kinetic model involve a transformation, e.g. TICA or VAMP projection to maximize auto-correlation, and clustering in this transformed space. In our study, we use the first five VAMP singular vectors for this step. Once discretized, the trajectories are then used to estimate a MSM. After validating the MSM, another VAMP score can be computed. All steps except the direct scoring of the feature space are established methods.

iii.1 Input features

We apply the described algorithm to a set of fast-folding protein trajectory datasets Lindorff-Larsen et al. (2011) (Fig. 1) in order to calculate scores for several commonly used features in Markov state modeling. We investigate the choices of input features:

  1. Aligned Cartesian coordinates. We first use Cartesian coordinates, which have been aligned—i.e., translated and rotated to minimize the root mean square deviation—to the folded structure for the simulation dataset.777The folded structures were chosen visually, before analysis, to replicate a naïve choice of reference frame.

  2. Distance-based features. As a baseline distance feature, the closest heavy-atom contacts between each pair of residues in the protein is recorded in nanometers, excluding proximal pairs at indices and . In addition to using this distance, denoted as , we also apply several transformations to :

    1. [label=.]

    2. .

  3. Contact features. To form the contact features, we apply five different cutoff values on the residue minimum distances in order to encode a contact in a binary manner. In other words, we calculate,

    for  Å.

  4. Solvent accessible surface area (SASA). SASA is computed by the Shrake-Rupley rolling ball algorithmShrake and Rupley (1973) as implemented in the MDTraj software package McGibbon et al. (2015). This algorithm estimates the surface exposed to solvent by moving a small sphere beyond the van der Waals radii of each atom of the protein. The radius was set to 1.4 Å, and the number of grid points to model the sphere was set to 960. The SASA for each residue is calculated from the sum of the SASA results for each of its component atoms.

  5. Dihedral features. The dihedral features contain the backbone torsional angles and as well as all possible side chain torsions up to . Each torsional angle is transformed into its sine and cosine in order to maintain periodicity.

  6. Combined distance transformation and dihedral features. Finally, we concatenate the transformed distance feature with the dihedral features.

iii.2 Covariance matrix estimation and cross-validation of scores

We will now sketch the algorithm for calculating the cross-validated VAMP-2 score of a feature space. As an input we have feature trajectories with lengths frames in . From these time series we estimate a covariance , cross-covariance and time-shifted covariance matrix (recall Eqns. (12)-(14)) by an online algorithm Chan, Golub, and LeVeque (1982).

In order to compute a cross validated VAMP score, we estimate a number of covariance matrix triplets , and . To do so, the time series is split into running blocks of the form:


where denotes the time-shift in each trajectory , is the current position in time, and the frames between inclusively and exclusively. We then randomly assign each block of the running series to one of the running covariances.

Figure 3: VAMP-2 scores of the five slowest processes for all test systems and all defined features at a lag time of

ns. Error bars represent standard errors across 50 cross-validation splits.

To estimate a set of covariance matrices, we perform running moments to

covariance matrices. For each pair of blocks as yielded by the sliding window with respect to  (22), we draw one matrix and update its blocks with the running moment, i.e.,


where denotes is the total number of time steps contained in all blocks, and .

To compute a cross validated score, we split the covariance matrices into disjoint training and test sets with indices according to a -fold strategy in which we subdivide the set into groups and choose one of the groups as test set. We then aggregate the covariance matrices,


with an appropriate weight , according to the number of samples in the relevant set. Then we calculate the VAMP- score for both the training and test data sets, yielding scores , . Utilizing the VAMP- score from Eqn. (21) and letting , one can obtain a CV score. The approximated prediction/consistency error is then .

Thus, following the notation in Sec. II and Eqns. (19) and (20), we can summarize the calculation of the cross-validated VAMP- score as follows, where is the Frobenius norm 888This is because the Frobenius norm is equivalent to the -Schatten norm for ; see Ref. Wu and Noé, 2017, Sec. 3.2., and and are calculated from the training set:


We note that the construction of an MSM is not required to obtain the score.

iii.3 Markov state models

To verify the previously computed VAMP scores on the input feature space, we build MSMs for Homeodomain, Protein G, and WW domain. We expect to see a similar relative score for the MSMs constructed from different feature sets, which is related to the timescales of the scored dynamical processes.

After obtaining the VAMP singular vectors as described above, the MSM estimation pipeline can be described as follows: prior to clustering, we project onto the first five right singular vectors of the VAMP basis, analogously to the established TICA method Pérez-Hernández et al. (2013). We then cluster this space into a set of cluster centers via -means, and estimate a maximum likelihood, reversible MSM on this discretization. For each MSM we perform a block splitting of the discrete trajectories to perform a cross-validated VAMP-2 scoring on the MSM as described in Sec. II and Ref. Wu and Noé, 2017. For further analysis we switch from the maximum likelihood estimate to a Bayesian approach Trendelkamp-Schroer et al. (2015), in order to compute errors. We evaluate the timescales and some representative structures for the slowest processes. Finally, we want to ensure that the models have predictive power under the Markovian approximation, so we evaluate the implied timescales and conduct a Chapman-Kolmogorov test (recall Eqns. (2) and (3)). This procedure is outlined in Fig. 2.

Iv Results

In Fig. 3 we show VAMP-2 scores for the five slowest processes for all twelve fast-folding proteins at a lag time of ns computed in the input space as described in the previous section. For the systems BBA, Villin, Trp-Cage, Chignolin, and Protein B, the combination of flexible torsions and the residue contact distance transformation (feature definition 6) yields the best overall score (henceforth referred to as the “combined” feature set). For Homeodomain, we find three almost equal scores for three different features, namely residue minimum distances and its transformation , and the combined feature. For -repressor, Homeodomain, and a3D the untransformed contact distances are superior, closely followed by distances transformed by (henceforth simply “transformed distances”). For BBL, the distances transformed by perform best. The contact-based features scores increase with the magnitude of the cutoff. For all systems, the per-residue SASA feature yielded the lowest score.

Figure 4: Bayesian MSM Trendelkamp-Schroer et al. (2015) timescales of the five slowest processes in Homeodomain for five feature sets and three different numbers of cluster centers. The SASA timescales are too fast to visualize.
Figure 5: Bayesian MSM Trendelkamp-Schroer et al. (2015) timescales of the five slowest processes in Protein G for five feature sets and five different numbers of cluster centers. The SASA timescales are too fast to visualize.

To visualize results for a couple of example systems, we consider Homeodomain and Protein G. In Figs. 4 and 5, we show the timescales for a set of three different -means clusterings, namely , , and cluster centers. The slowest processes are consistently covered best by the combined feature set. Both systems show a slight increase of the timescale of the slowest process for finer discretizations, which is expected in the absence of cross-validation Sarich, Noé, and Schütte (2010); McGibbon and Pande (2015). The timescales of the second and following slowest processes appear to be more stable with respect to to the number of cluster centers.

Our analysis shows that the torsions feature alone can capture only processes happening on timescales half as long as those processes that can be described after combining torsions with the transformed contact distances. The transformed contact distance feature alone is relatively effective, but its effectiveness is increased by combining it with flexible torsions. Aligned Cartesian coordinates yield timescales in same regime as the torsion model. The SASA feature does not capture any slow processes, as we already expected from the low score in the feature space. Construction of an MSM for further analysis should involve investigating many state decompositions (i.e., numbers of microstates) under cross-validation McGibbon and Pande (2015). An optimized state decomposition will (by definition) increase the timescales obtained by the MSM, while the use of a training/validation dataset split will in general decrease the timescales for every state decomposition by avoiding overfitting.

We note that the slowest process in the best-performing Protein G MSMs is extremely slow ( ms). We visualize the dominant process of a 2000-microstate MSM built from transformed distances in Fig. 6 and see that the process represents conversion between the folded structure (B; red) and an unfolded structure (A; blue) consistent regions of order and disorder. It has been demonstrated through the use of subsequent simulations that this dataset is undersampled Schwantes, Shukla, and Pande (2016), so it is likely that this process is observed only a few times and does not represent folding in general 999We constructed MSMs with cross-validated state decompositions and still observed this timescale..

Figure 6: Representative structures of Protein G of the slowest process (transition ) captured by a BayesianMSM Trendelkamp-Schroer et al. (2015) with a lag time of  ns, built upon 2000 microstates and combined transformed distances and flexible torsions feature.
Figure 7: Comparison of maximum-likelihood MSM and feature VAMP scores for Homeodomain, WW-domain, and Protein G for four representative feature sets. The MSMs were constructed using 1000 microstates. Both sets of error bars are obtained from 50 cross-validation splits. The feature VAMP columns are reproduced from Fig. 3.

Although we restrict our analysis to the most dominant process, we can see that the differences in MSM VAMP- scores for 1000-microstate MSMs built from different features correspond to the trends observed in the previously calculated VAMP- feature scores (Fig. 7). The VAMP analysis of feature sets can be viewed as pessimistic relative to the full MSM variational analysis (see discussion in Sec. V, to follow). However, while MSMs tend to improve or preserve variational scores, this is observed to occur across the board, without causing a change in trends. Thus, we observe that, for the goal of choosing the best-performing MSM, we likely could have restricted ourselves to building models from only the best-performing features in the initial analysis.

Figure 8: First 20 singular values of VAMP computed on aligned Cartesian coordinates, transformed () distances, and the combination of transformed distances and flexible torsions.
Figure 9: Bayesian MSM Trendelkamp-Schroer et al. (2015) timescales of the five slowest processes in WW domain for five feature sets and three different numbers of cluster centers. The SASA timescales are too fast to visualize.

For all systems except WW domain, the aligned Cartesian coordinates (XYZ) show a worse performance than distance-based features. The WW domain, however, shows the largest score for aligned Cartesian coordinates, which is significantly higher than all the other feature sets. Because we only observe this result in one system, we take a closer look and plot the first twenty singular values for three feature sets (Fig. 8) as well as the five slowest MSM processes obtained using five representative feature sets and three different numbers of clusters (Fig. 9). By visualizing the extremes of the dominant MSM eigenvector, we see that the aligned coordinates capture the folding process (Fig. 10). In contrast, the slowest process of the combined feature only picks up rather fast loop movements, with the three -sheets already formed101010 These results were verified for several starting structures of WW domain in addition to the one visualized in Fig. 1.. This discrepancy shows why we see a significantly different VAMP score in feature space for aligned Cartesian coordinates in contrast to the other features.

Although the feature VAMP singular values are consistently higher for the WW domain when evaluated on the aligned Cartesian coordinates (Fig. 8), the MSM timescales show that transformed distances and the combined feature produce similar timescales for the slowest process in the MSM (Fig. 9). However, for subsequent processes, the MSM timescales decrease for transformed distance-based features, but remain high for the aligned Cartesian coordinates. This shows why the Cartesian coordinates perform the best according to the VAMP score, which we recall incorporates the sum of multiple singular values using Eqn. (21).

When we compare the ability to build a converged MSM with both Cartesian coordinates and the combined feature (Figs. S1 and S2 in the supplementary materials), we observe better performance for the combined feature in terms of long term predictions of the kinetics according to the Chapman-Kolmogorov test. In our discussion of the WW domain results, we have performed representative analyses for selection, validation, and investigation of a kinetic model. Before using a kinetic model of this (or any) system built from any of these feature sets, it would be necessary to further investigate the meaningfulness of the processes in a given feature set in the context of the desired analysis.

Figure 10: Representative structures for the slowest process in WW domain (red blue) extracted from a BayesianMSM with states each. A shows the feature residue minimum distance transformed with combined with flexible torsions and B uses Cartesian coordinates aligned with the folded state.

V Discussion

The introduction of variational principles into MD analyses over the past five years have provided the tools to optimize the approximation quality of not only reversible MSMs, but also the features used to construct reversible or nonreversible dynamical models from features, either as models themselves or as a step toward eventual MSM construction. With these tools, we have presented a rigorous algorithm to optimize the feature selection step in traditional MSM construction. While subsequent steps such as TICA, cluster discretization, and transition matrix approximation have been relatively optimized, feature selection has remained a challenge for constructing optimal models.

The method presented in this study demonstrates how recent algorithmic advances enable objective evaluation given a set of feature options. By optimizing features before building MSMs, we can evaluate their performance directly and more efficiently, instead of in combination with whatever other modeling choices are needed to obtain the MSM. We showed that selecting features using the VAMP differentiates them according to the approximation quality of the Koopman matrix obtained from those features. Constructing MSMs from these feature sets (i.e., from their right singular VAMP vectors) and evaluating their timescales maintains and verifies the ranking from performing the VAMP scoring on the features themselves. We expect that the practical use of this algorithm will be to quickly eliminate poorly performing features sets, so that further optimization can be performed on subsequent parameter choices using the better options for features. Our results show that contact distances transformed by combined with flexible torsions is a consistently well-performing feature set for the folding of small globular proteins, and are likely to perform well for a broader class of conformational transitions in biomolecules. Additional and more specific optimization of features for a class of molecular systems can be done by following the methodology described here.

As shown, constructing MSMs tends to improve model approximation quality due to the state discretization (e.g., transition regions can be finely discretized, which is known to produce better models Sarich, Noé, and Schütte (2010)). The discretization process involved in MSM construction will affect features differently depending on the extent to which they are already discretized. For example, nontransformed contact distances will be finely discretized along the coordinate according to the MSM states, whereas a binary contact map cannot be finely discretized with the MSM state decomposition. Thus, a continuous coordinate may demonstrate substantial improvement according to a variational score upon MSM construction, whereas an already-discretized coordinate is not expected to improve with MSM construction.

Since the first applications of the MSM framework to the analysis of biomolecules simulated with MD, the decision of how to transform the raw Cartesian coordinates from a simulation dataset into well-performing features has been an omnipresent challenge. While the categorical aspect of feature selection prohibits pure automation (except, perhaps, in the case of treatment with neural networks 

Mardt et al. (2018)), the method presented in this work enables a direct evaluation of features from which predictive models can be more efficiently built.


We are grateful to Tim Hempel, Nuria Plattner, Andreas Mardt, and Simon Olsson for helpful discussions. Funding is acknowledged from 760 European Commission (ERC CoG 772230 “ScaleCell”), 761 Deutsche Forschungsgemeinschaft (SFB 1114 Projects 762 C03 and A04, SFB 740 Project D07, NO825/2-2), and 763 the MATH+ cluster (Project EF1-2).



  • Hänggi and Talkner (1983) P. Hänggi and P. Talkner, “Memory index of first-passage time: a simple measure of non-Markovian character,” Phys. Rev. Lett. 51, 2242 (1983).
  • Shalloway (1996) D. Shalloway, “Macrostates of classical stochastic systems,” J. Chem. Phys. 105, 9986–10007 (1996).
  • Du et al. (1998) R. Du, V. S. Pande, A. Y. Grosberg, T. Tanaka,  and E. S. Shakhnovich, “On the transition coordinate for protein folding,” J. Chem. Phys. 108, 334–350 (1998).
  • Rohrdanz, Zheng, and Clementi (2013) M. A. Rohrdanz, W. Zheng,  and C. Clementi, “Discovering mountain passes via torchlight: methods for the definition of reaction coordinates and pathways in complex macromolecular reactions,” Annu. Rev. Phys. Chem. 64, 295–316 (2013).
  • Prinz, Chodera, and Noé (2014) J.-H. Prinz, J. D. Chodera,  and F. Noé, “Spectral rate theory for two-state kinetics,” Phys. Rev. X 4, 011020 (2014).
  • Noé and Clementi (2017) F. Noé and C. Clementi, “Collective variables for the study of long-time kinetics from molecular trajectories: theory and methods,” Curr. Opin. Struct. Biol. 43, 141–147 (2017).
  • Schütte et al. (1999) C. Schütte, A. Fischer, W. Huisinga,  and P. Deuflhard, “A direct approach to conformational dynamics based on hybrid monte carlo,” J. Comput. Phys. 151, 146–168 (1999).
  • Swope, Pitera, and Suits (2004) W. C. Swope, J. W. Pitera,  and F. Suits, “Describing protein folding kinetics by molecular dynamics simulations. 1. theory,” The Journal of Physical Chemistry B 108, 6571–6581 (2004).
  • Singhal, Snow, and Pande (2004) N. Singhal, C. D. Snow,  and V. S. Pande, “Using path sampling to build better Markovian state models: predicting the folding rate and mechanism of a tryptophan zipper beta hairpin,” J. Chem. Phys. 121, 415–425 (2004).
  • Noé et al. (2007) F. Noé, I. Horenko, C. Schütte,  and J. C. Smith, “Hierarchical analysis of conformational dynamics in biomolecules: transition networks of metastable states,” J. Chem. Phys. 126, 04B617 (2007).
  • Chodera et al. (2007)

    J. D. Chodera, N. Singhal, V. S. Pande, K. A. Dill,  and W. C. Swope, “Automatic discovery of metastable states for the construction of markov models of macromolecular conformational dynamics,” J. Chem. Phys. 

    126, 04B616 (2007).
  • Prinz et al. (2011) J.-H. Prinz, H. Wu, M. Sarich, B. Keller, M. Senne, M. Held, J. D. Chodera, C. Schütte,  and F. Noé, “Markov models of molecular kinetics: Generation and validation,” J. Chem. Phys. 134, 174105 (2011).
  • Sriraman, Kevrekidis, and Hummer (2005) S. Sriraman, I. G. Kevrekidis,  and G. Hummer, “Coarse master equation from bayesian analysis of replica molecular dynamics simulations,” J. Phys. Chem. B 109, 6479–6484 (2005).
  • Buchete and Hummer (2008) N.-V. Buchete and G. Hummer, “Coarse master equations for peptide folding dynamics,” J. Phys. Chem. B 112, 6057–6069 (2008).
  • Rohrdanz et al. (2011) M. A. Rohrdanz, W. Zheng, M. Maggioni,  and C. Clementi, “Determination of reaction coordinates via locally scaled diffusion map,” J. Chem. Phys. 134, 03B624 (2011).
  • Peters (2006) B. Peters, “Using the histogram test to quantify reaction coordinate error,” J. Chem. Phys. 125, 241101 (2006).
  • E and Vanden-Eijnden (2006) W. E and E. Vanden-Eijnden, “Towards a theory of transition paths,” J. Stat. Phys. 123, 503 (2006).
  • Hummer (2005) G. Hummer, “Position-dependent diffusion coefficients and free energies from bayesian analysis of equilibrium and replica molecular dynamics simulations,” New J. Phys. 7, 34 (2005).
  • Chodera et al. (2006) J. D. Chodera, W. C. Swope, J. W. Pitera,  and K. A. Dill, “Long-time protein folding dynamics from short-time molecular dynamics simulations,” Multiscale Model. Simul. 5, 1214–1226 (2006).
  • Kube and Weber (2007) S. Kube and M. Weber, “A coarse graining method for the identification of transition rates between molecular conformations,” J. Chem. Phys. 126, 024103 (2007).
  • Altis et al. (2007)

    A. Altis, P. H. Nguyen, R. Hegger,  and G. Stock, “Dihedral angle principal component analysis of molecular dynamics simulations,” J. Chem. Phys. 

    126, 244111 (2007).
  • Noé et al. (2009) F. Noé, C. Schütte, E. Vanden-Eijnden, L. Reich,  and T. R. Weikl, “Constructing the equilibrium ensemble of folding pathways from short off-equilibrium simulations,” Proc. Natl. Acad. Sci. 106, 19011–19016 (2009).
  • Jayachandran et al. (2006) G. Jayachandran, M. R. Shirts, S. Park,  and V. S. Pande, “Parallelized-over-parts computation of absolute binding free energy with docking and molecular dynamics,” J. Chem. Phys. 125, 084901 (2006).
  • Yang and Roux (2008) S. Yang and B. Roux, “Src kinase conformational activation: thermodynamics, pathways, and mechanisms,” PLoS Comput. Biol. 4, e1000047 (2008).
  • Muff and Caflisch (2008) S. Muff and A. Caflisch, “Kinetic analysis of molecular dynamics simulations reveals changes in the denatured state and switch of folding pathways upon single-point mutation of a -sheet miniprotein,” Proteins: Struct., Funct., Bioinf. 70, 1185–1195 (2008).
  • Sarich, Noé, and Schütte (2010) M. Sarich, F. Noé,  and C. Schütte, “On the approximation quality of Markov state models,” Multiscale Model. Simul. 8, 1154–1177 (2010).
  • Noé and Nüske (2013) F. Noé and F. Nüske, “A variational approach to modeling slow processes in stochastic dynamical systems,” Multiscale Model. Simul. 11, 635–655 (2013).
  • Nüske et al. (2014) F. Nüske, B. G. Keller, G. Pérez-Hernández, A. S. Mey,  and F. Noé, “Variational approach to molecular kinetics,” J. Chem. Theory Comput. 10, 1739–1752 (2014).
  • Molgedey and Schuster (1994) L. Molgedey and H. G. Schuster, “Separation of a mixture of independent signals using time delayed correlations,” Phys. Rev. Lett. 72, 3634 (1994).
  • Pérez-Hernández et al. (2013) G. Pérez-Hernández, F. Paul, T. Giorgino, G. De Fabritiis,  and F. Noé, “Identification of slow molecular order parameters for Markov model construction,” J. Chem. Phys. 139, 07B604_1 (2013).
  • Noé and Clementi (2015) F. Noé and C. Clementi, “Kinetic distance and kinetic maps from molecular dynamics simulation,” J. Chem. Theory Comput. 11, 5002–5011 (2015).
  • Noé, Banisch, and Clementi (2016) F. Noé, R. Banisch,  and C. Clementi, “Commute maps: Separating slowly mixing molecular configurations for kinetic modeling,” J. Chem. Theory Comput. 12, 5620–5630 (2016).
  • Pérez-Hernández and Noé (2016) G. Pérez-Hernández and F. Noé, “Hierarchical time-lagged independent component analysis: computing slow modes and reaction coordinates for large molecular systems,” J. Chem. Theory Comput. 12, 6118–6129 (2016).
  • Schwantes and Pande (2015) C. R. Schwantes and V. S. Pande, “Modeling molecular kinetics with tica and the kernel trick,” J. Chem. Theory Comput. 11, 600–608 (2015).
  • Harmeling et al. (2003) S. Harmeling, A. Ziehe, M. Kawanabe,  and K.-R. Müller, “Kernel-based nonlinear blind source separation,” Neural Comput. 15, 1089–1124 (2003).
  • Wu et al. (2017) H. Wu, F. Nüske, F. Paul, S. Klus, P. Koltai,  and F. Noé, “Variational koopman models: slow collective variables and molecular kinetics from short off-equilibrium simulations,” J. Chem. Phys. 146, 154104 (2017).
  • Wu and Noé (2017) H. Wu and F. Noé, “Variational approach for learning Markov processes from time series data,” arXiv preprint arXiv:1707.04659  (2017).
  • McGibbon and Pande (2015) R. T. McGibbon and V. S. Pande, “Variational cross-validation of slow dynamical modes in molecular kinetics,” J. Chem. Phys. 142, 03B621_1 (2015).
  • Scherer et al. (2015) M. K. Scherer, B. Trendelkamp-Schroer, F. Paul, G. Pérez-Hernández, M. Hoffmann, N. Plattner, C. Wehmeyer, J.-H. Prinz,  and F. Noé, “Pyemma 2: A software package for estimation, validation, and analysis of Markov models,” J. Chem. Theory Comput. 11, 5525–5542 (2015).
  • Husic et al. (2016) B. E. Husic, R. T. McGibbon, M. M. Sultan,  and V. S. Pande, “Optimized parameter selection reveals trends in Markov state models for protein folding,” J. Chem. Phys. 145, 194103 (2016).
  • Mardt et al. (2018) A. Mardt, L. Pasquali, H. Wu,  and F. Noé, “Vampnets for deep learning of molecular kinetics,” Nat. Commun. 9, 5 (2018).
  • Kubelka, Hofrichter, and Eaton (2004) J. Kubelka, J. Hofrichter,  and W. A. Eaton, “The protein folding ‘speed limit’,” Curr. Opin. Struct. Biol. 14, 76–88 (2004).
  • Lindorff-Larsen et al. (2011) K. Lindorff-Larsen, S. Piana, R. O. Dror,  and D. E. Shaw, “How fast-folding proteins fold,” Science 334, 517–520 (2011).
  • Beauchamp et al. (2012) K. A. Beauchamp, R. McGibbon, Y.-S. Lin,  and V. S. Pande, “Simple few-state models reveal hidden complexity in protein folding,” Proc. Natl. Acad. Sci. 109, 17807–17813 (2012).
  • Dickson and Brooks III (2013) A. Dickson and C. L. Brooks III, “Native states of fast-folding proteins are kinetic traps,” J. Am. Chem. Soc. 135, 4729–4734 (2013).
  • Weber, Jack, and Pande (2013) J. K. Weber, R. L. Jack,  and V. S. Pande, “Emergence of glass-like behavior in Markov state models of protein folding dynamics,” J. Am. Chem. Soc. 135, 5501–5504 (2013).
  • Husic and Pande (2018) B. E. Husic and V. S. Pande, “Markov state models: From an art to a science,” J. Am. Chem. Soc. 140, 2386–2396 (2018).
  • (48) For more details on this estimation, see Ref. wu2017variational, §II C.
  • (49) Ref. wu2017vamp, Thm. 1.
  • (50) To see the equivalence, as an intermediate step write .
  • (51) While the Koopman reweighting estimator introduced in Ref. wu2017variational removes bias, it has a relatively large variance.
  • (52) See Ref. wu2017vamp, Appendix F for details.
  • (53) When the whitening as suggested in Ref. wu2017vamp is used, and become identity matrices when calculated from the whitened data. However, other whitening procedures (or their omission when and are already invertible) will produce different results, hence our more general formulation.
  • Gross, Oliveira, and Kohn (1988) E. K. Gross, L. N. Oliveira,  and W. Kohn, “Rayleigh-ritz variational principle for ensembles of fractionally occupied states,” Phys. Rev. A 37, 2805 (1988).
  • (55) The folded structures were chosen visually, before analysis, to replicate a naïve choice of reference frame.
  • Shrake and Rupley (1973) A. Shrake and J. Rupley, “Environment and exposure to solvent of protein atoms. lysozyme and insulin,” J. Mol. Biol. 79, 351–371 (1973).
  • McGibbon et al. (2015) R. T. McGibbon, K. A. Beauchamp, M. P. Harrigan, C. Klein, J. M. Swails, C. X. Hernández, C. R. Schwantes, L.-P. Wang, T. J. Lane,  and V. S. Pande, “Mdtraj: a modern open library for the analysis of molecular dynamics trajectories,” Biophys. J. 109, 1528–1532 (2015).
  • Chan, Golub, and LeVeque (1982) T. F. Chan, G. H. Golub,  and R. J. LeVeque, “Updating formulae and a pairwise algorithm for computing sample variances,” in COMPSTAT 1982 5th Symposium held at Toulouse 1982 (Springer, 1982) pp. 30–41.
  • (59) This is because the Frobenius norm is equivalent to the -Schatten norm for ; see Ref. wu2017vamp, Sec. 3.2.
  • Trendelkamp-Schroer et al. (2015) B. Trendelkamp-Schroer, H. Wu, F. Paul,  and F. Noé, “Estimation and uncertainty of reversible Markov models,” J. Chem. Phys. 143, 11B601_1 (2015).
  • Schwantes, Shukla, and Pande (2016) C. R. Schwantes, D. Shukla,  and V. S. Pande, “Markov state models and tICA reveal a nonnative folding nucleus in simulations of NuG2,” Biophys. J. 110, 1716–1719 (2016).
  • (62) We constructed MSMs with cross-validated state decompositions and still observed this timescale.
  • (63) These results were verified for several starting structures of WW domain in addition to the one visualized in Fig. 1.