I Introduction
The first step in analyzing the states, equilibrium behavior, or kinetics of complex molecules based on molecular dynamics (MD) simulations is typically the choice of a suitable set of features describing the atomic configurations. This choice is particularly important when the goal is to compute kinetic quantities, such as transition rates or committor probabilities, as these quantities are sensitive to resolving the transitions between the longlived (metastable) states
Hänggi and Talkner (1983); Shalloway (1996); Du et al. (1998); Rohrdanz, Zheng, and Clementi (2013); Prinz, Chodera, and Noé (2014); Noé and Clementi (2017). Typically, these transformations input the raw Cartesian coordinates produced from the MD simulation and output a new set of coordinates that is translation and rotationinvariant. In proteins, biologicallymotivated choices include the backbone dihedral angles (torsions), or the pairwise distances between all amino acid residues taken between carbons or the closest pairs of (heavy) atoms. Many such coordinate sets for proteins are imaginable, such as further transformations of the aforementioned contact distances, the solventaccessible surface area (SASA) of protein residues or other groups, or sidechain torsion angles.A number of kinetic analysis frameworks critically depend on the set of features used as input, in particular Markov state models (MSMs) Schütte et al. (1999); Swope, Pitera, and Suits (2004); Singhal, Snow, and Pande (2004); Noé et al. (2007); Chodera et al. (2007); Prinz et al. (2011), master equation models Sriraman, Kevrekidis, and Hummer (2005); Buchete and Hummer (2008), diffusion maps Rohrdanz et al. (2011), and methods to select optimal reaction coordinates Peters (2006); E and VandenEijnden (2006). The analysis framework employed affects what is a meaningful definition of optimality that can be used to select input features.
Here we discuss optimal feature selection in particular with MSMs in mind. In the first generation of MSMs, whose aim was to construct a Markov chain between a few metastable states that partition state space
Schütte et al. (1999); Swope, Pitera, and Suits (2004); Noé et al. (2007); Chodera et al. (2007), it was already noted that the accuracy with which MSMs could make longtime predictions depended critically on the choice of features. Utilized featurization methods included torsion angles Hummer (2005); Chodera et al. (2006); Noé et al. (2007); Kube and Weber (2007), principal components in torsion or Cartesian coordinates Altis et al. (2007); Noé et al. (2009), contact pairs Jayachandran et al. (2006); Yang and Roux (2008), as well as other transformations involving such attributes as secondary structure Buchete and Hummer (2008); Muff and Caflisch (2008). From these representations, clustering methods can be used to determine the MSM states, which can be subsequently checked for adherence to the Markovian approximation. The second generation of MSMs is characterized by the finding that the predictive power of an MSM depends upon its states being chosen such that a good discretization of the eigenfunctions of the Markov operator is obtained
Sarich, Noé, and Schütte (2010); Prinz et al. (2011). These eigenfunctions are collective variables that indicate the rareevent processes connecting the metastable states, and the task of discretizing them well translates into the task of using input features that allow to resolve them.At this time, we mainly had visual diagnostic tools to assess the predictive performance of MSMs, such as the implied timescales test Swope, Pitera, and Suits (2004) and the ChapmanKolmogorow test Prinz et al. (2011). The choice of the hyperparameters of MSMs, such as number of states and the set of input features, remained a trial and error procedure. This changed in 2013, when Noé and Nüske presented a variational principle that quantifies how well a given set of coordinates, features or a given MSM resolve the Markov operator eigenfunctions, and thus the slowest processes Noé and Nüske (2013); Nüske et al. (2014)
. This variational approach to conformational dynamics (VAC) has been highly developed in the past five years: timelagged independent component analysis (TICA), an algorithm devised in machine learning
Molgedey and Schuster (1994), has been shown to be the optimal linear approximator to the Markov operator eigenfunctions PérezHernández et al. (2013); an embedding of the eigenfunctions approximated by VAC or TICA into a kinetic map has been proposed, in which distance are related to transition times Noé and Clementi (2015); Noé, Banisch, and Clementi (2016); and hierarchical PérezHernández and Noé (2016)and kernelbased estimators
Schwantes and Pande (2015); Harmeling et al. (2003) have been proposed, and methods to estimate VAC/TICA from short offequilibrium trajectories Wu et al. (2017). Recently, VAC has been generalized to the variational approach for Markov processes (VAMP), which can accommodate nonreversible dynamics Wu and Noé (2017).A key insight is that the variational principle defines a score which can be used with standard machine learning approaches for hyperparameter selection of MSMs. This was pioneered in Ref. McGibbon and Pande, 2015
which shows that using VAC with crossvalidation is a tool for selecting the statistically optimal number of MSM states. Using a VACderived kinetic variance as a score, optimal feature selection was discussed in Refs.
Scherer et al., 2015. The entire MSM pipeline was subject to VACoptimization in Ref. Husic et al., 2016, revealing general trends in what makes a good MSM for fast protein folding. VAMP was used for hyperparameter optimization Wu and Noé (2017); Mardt et al. (2018), and in order to define the loss functions for VAMPnets, a deep learning method to infer MSMs from data
Mardt et al. (2018).The construction of an MSM involves (a) choosing an appropriate transformation to collective variables referred to as features, (b) optionally performing a basis set transformation using TICA (or, alternatively, stopping here and using the TICA result as the kinetic model), (c) decomposing the transformed trajectories into states, and (d) approximating a Markovian transition matrix from the state decomposition. Accounting for the relatively established use of just a few clustering algorithms for the state decomposition Scherer et al. (2015); Husic et al. (2016), steps (b) and (c) have largely been automated using the VAC—and, more recently, the VAMP. However, step (a) is difficult to automate using current methods. In contrast, it is straightforward to select a random number of states, construct a few hundred crossvalidated MSMs, and identify which number of states achieves the highest VAMP score. While this can in principle also be done for collective variable transformations as well, repeated construction of the entire MSM pipeline for a variety of input features becomes computationally extremely demanding. A complete search of a hyperparameter space (features, TICA dimension, clustering method and number of clusters, etc.) is clearly unfeasible. Hence a variationally optimal method for feature selection that does not require going through the additional steps and choices of building an MSM would be very useful.
Motivated by these difficulties, in this paper we describe an approach that introduces a theoretically rigorous method for the the choice of input features and an accompanying algorithm which enables the researcher to quantify this choice. This could in principle also be used to automate feature selection. We study this approach by applying the method to a canonical dataset of twelve fastfolding proteins simulated near their experimental melting temperatures Kubelka, Hofrichter, and Eaton (2004); LindorffLarsen et al. (2011). These systems switch between the folded and unfolded state rapidly, and each trajectory dataset contains at least 10 instances of both folding and unfolding. This dataset, which contains fast folding proteins possessing a variety of secondary structure combinations, has been frequently used in its entirety to investigate methods advances Beauchamp et al. (2012); Dickson and Brooks III (2013); Weber, Jack, and Pande (2013); Husic et al. (2016). After evaluating features on all twelve proteins, we build representative MSMs to illustrate an example analysis.
Associated code is available at github.com/markovmodel/feature_selection. The computation of VAMP scores is implemented in the VAMP estimator method of the PyEMMA software package Scherer et al. (2015), which can be found at github.com/markovmodel/PyEMMA.
Ii Theory
Here we summarize the necessary theory underlying VAMPbased feature selection, starting with MSMs and the VAC before continuing on to the Koopman matrix and the VAMP used in this work. The more practically inclined reader may proceed to the Sec. III—the bottom line of this section is that the VAMP score of a given set of features can be computed by implementing equations (1221). For a more detailed theoretical discussion, we refer the reader to Refs. Wu et al., 2017, Wu and Noé, 2017, and the publication “Identification of kinetic order parameters for nonequilibrium dynamics” by Paul et al. in this issue.
ii.1 The Markov state model transition matrix
MSMs model dynamics as a coarsegrained Markov chain on sets that partition the state space. For this discussion, we assume a single long trajectory, although the method may be applied to a distributed set of independent trajectories. The Markov chain is described by the conditional transition probabilities,
(1) 
where represents the probability that the system is in set at time conditioned upon it being in set at time . These probabilities can be estimated from trajectories initiated from local equilibrium distributions, and do not require that the system is in global equilibrium Prinz et al. (2011).
The conditional transition probabilities are gathered in a square probability transition matrix, where each entry represents the conditional probability of transitioning from the set described by row index to the set described by column index at the defined lag time . When a sufficiently long lag time and adequate sets have been chosen, the dynamics can be approximated as Markovian, and the are independent of the history of the system. Thus, many independent trajectories obtained from distributed simulations can be threaded together through common sets (henceforth, “states”).
Once a simulation dataset has been divided into states, the
estimates are obtained from an analysis of observed transition counts. The observed transition counts are converted into conditional transition probabilities such that the dynamics are reversible. For MSM analysis, the eigendecomposition of the reversible transition matrix contains information about the thermodynamics and dynamics of the modeled system. Its stationary distribution is given by the eigenvector corresponding to the unique maximum eigenvalue of
. The remaining eigenvalues, which are restricted to the interval for , correspond to dynamical processes within the system, and define timescales as a function of the eigenvalue and the MSM lag time:(2) 
where is the MSM approximation to the dynamical propagator at lag time , and the absolute values are used by convention to avoid the imaginary timescales resulting from projection of the system dynamics.
To test whether the Markovian assumption is appropriate for an MSM approximated at a given lag time, the timescales can be plotted as a function of increasing lag time to observe if they have converged to an approximately constant value at the lag time of the estimator; this is referred to as validating the model’s implied timescales. Evaluation of the implied timescales is special case of a more general validation tool, the ChapmanKolmogorov test, which evaluates the appropriateness of the Markovian assumption according to adherence to the propertyPrinz et al. (2011),
(3) 
ii.2 The variational approach for conformational dynamics (VAC)
The eigenvalues and eigenvectors of the transition probability matrix correspond to the stationary and dynamical processes in the system, and these quantities can be used to interpret MD simulation data once states have been determined and the MSM is constructed Prinz et al. (2011); Husic and Pande (2018). The discrete eigenvectors are in fact approximations to continuous eigenfunctions, and the transition probability matrix is a finitedimensional approximation to an infinitedimensional continuous linear operator called the transfer operator, which describes the system dynamics Schütte et al. (1999); Prinz et al. (2011). In an MSM the system is described by a disjoint set of discrete states, which can be represented by a basis set of indicator functions , i.e.,
(4) 
Importantly, the VAC generalizes beyond MSMs and does not require an indicator basis set Noé and Nüske (2013); Nüske et al. (2014). Rather, the VAC can be applied to an arbitrary basis set . Given a basis set, we then transform all observed in the dataset and estimate two covariance matrices from the transformed data: , the covariance matrix, and , the timelagged covariance matrix^{1}^{1}1For more details on this estimation, see Ref. Wu et al., 2017, §II C.,
(5)  
(6) 
where the
are matrices containing the feature vectors
, and the subscript on the expected value indicates that is sampled from the stationary distribution on the interval , for total time.With these estimates, we can proceed to solve the generalized eigenvalue problem,
(7) 
where , and the matrix provides vectors of expansion coefficients , which can be substituted into the ansatz to obtain the approximated eigenfunctions Noé and Nüske (2013); Nüske et al. (2014); Wu et al. (2017):
(8) 
Since the ansatz is linear in the vector , characterizes the linear VAC. The expansion coefficients are chosen such that the maximize the Rayleigh trace,
(9)  
where is the Kronecker delta. The definition of the score turns MSM estimation into a machine learning problem, where tools such as crossvalidation can be used to determine hyperparameters McGibbon and Pande (2015).
ii.3 Estimating Koopman matrix and VAMP score
In Ref. Wu and Noé, 2017, a new variational approach is introduced that does not require the operator it approximates to be reversible, nor does it require simulation data that are in equilibrium. The operator is the Koopman operator, which is approximated by the Koopman matrix. The corresponding generalized master equation is given by,
(10) 
where and are matrices storing the feature transformations and , respectively.
Wu and Noé (2017) show that optimal choices for and
can be determined using the singular value decomposition (SVD) of the Koopman matrix and setting
and to its top left and right singular functions, respectively^{2}^{2}2Ref. Wu and Noé, 2017, Thm. 1.. For an MSM, and are basis sets of indicator functions as defined in Eqn. (4), and Eqn. (10) is equivalent to Eqn. (1)^{3}^{3}3To see the equivalence, as an intermediate step write ..A new variational principle—the VAMP—can then be applied to approximate the singular values from the singular functions, by maximizing the Rayleigh trace as in Eqn. (9) for the dominant singular values, except without requiring ,
(11)  
where and perform the expected value over the starting points of the time windows and , respectively (which is no longer required to represent a stationary sample) and is the total time.
The VAMP was designed to be amenable to nonreversible processes; therefore it is useful to permit and to be different. This enables an adapted description of the dynamics that is different at the beginning and end of a transition of duration , because the system may have changed to the extent that it makes sense to adapt a new basis for the system after the lag time. Although we typically have stationary dynamics in MD datasets, we use the VAMP because we do not need to enforce reversibility in the dynamics as in the VAC, nor do we need to perform the statistically unfavorable reweighting ^{4}^{4}4While the Koopman reweighting estimator introduced in Ref. Wu et al., 2017 removes bias, it has a relatively large variance. described in Ref. Wu et al., 2017. For the VAMP, we require the following three covariance matrices,
(12)  
(13)  
(14) 
In order to perform the approximation of , we require and to be invertible; thus, the basis set must be decorrelated, or whitened, before proceeding^{5}^{5}5See Ref. Wu and Noé, 2017, Appendix F for details.. Once we have obtained decorrelated basis sets, we can perform the approximation,
(15) 
with (the first singular values), and contain the corresponding left and right singular vectors, and the bars indicate matrices that have been produced from whitened data ^{6}^{6}6When the whitening as suggested in Ref. Wu and Noé, 2017 is used, and become identity matrices when calculated from the whitened data. However, other whitening procedures (or their omission when and are already invertible) will produce different results, hence our more general formulation..
The matrices and are used to calculate the optimal and as follows:
(16)  
(17) 
where is the whitened basis,
(19)  
(20) 
From we can write the VAMP score:
(21) 
Previously, in the VAC, we summed the dominant eigenvalues of the MSM transition matrix to obtain the model’s score, which is variationally bounded from above. In a subsequent work, Noé and Clementi (2015) demonstrated that the sum of the squared eigenvalues is also variationally bounded from above, and can be maximized to obtain the kinetic variance described by the approximator. In fact, any nonincreasing weight can be applied to the eigenvalues and the variational principle will still hold Gross, Oliveira, and Kohn (1988). Thus, we can sum the highest singular values, each raised to an exponent , to obtain the VAMP score, where VAMP is the variational approach for Markov processes Wu and Noé (2017). VAMP is analogous to the Rayleigh trace, and VAMP is analogous to the kinetic variance introduced in Ref. Noé and Clementi, 2015.
Iii Methods
iii.1 Input features
We apply the described algorithm to a set of fastfolding protein trajectory datasets LindorffLarsen et al. (2011) (Fig. 1) in order to calculate scores for several commonly used features in Markov state modeling. We investigate the choices of input features:

Aligned Cartesian coordinates. We first use Cartesian coordinates, which have been aligned—i.e., translated and rotated to minimize the root mean square deviation—to the folded structure for the simulation dataset.^{7}^{7}7The folded structures were chosen visually, before analysis, to replicate a naïve choice of reference frame.

Distancebased features. As a baseline distance feature, the closest heavyatom contacts between each pair of residues in the protein is recorded in nanometers, excluding proximal pairs at indices and . In addition to using this distance, denoted as , we also apply several transformations to :

[label=.]




.


Contact features. To form the contact features, we apply five different cutoff values on the residue minimum distances in order to encode a contact in a binary manner. In other words, we calculate,
for Å.

Solvent accessible surface area (SASA). SASA is computed by the ShrakeRupley rolling ball algorithmShrake and Rupley (1973) as implemented in the MDTraj software package McGibbon et al. (2015). This algorithm estimates the surface exposed to solvent by moving a small sphere beyond the van der Waals radii of each atom of the protein. The radius was set to 1.4 Å, and the number of grid points to model the sphere was set to 960. The SASA for each residue is calculated from the sum of the SASA results for each of its component atoms.

Dihedral features. The dihedral features contain the backbone torsional angles and as well as all possible side chain torsions up to . Each torsional angle is transformed into its sine and cosine in order to maintain periodicity.

Combined distance transformation and dihedral features. Finally, we concatenate the transformed distance feature with the dihedral features.
iii.2 Covariance matrix estimation and crossvalidation of scores
We will now sketch the algorithm for calculating the crossvalidated VAMP2 score of a feature space. As an input we have feature trajectories with lengths frames in . From these time series we estimate a covariance , crosscovariance and timeshifted covariance matrix (recall Eqns. (12)(14)) by an online algorithm Chan, Golub, and LeVeque (1982).
In order to compute a cross validated VAMP score, we estimate a number of covariance matrix triplets , and . To do so, the time series is split into running blocks of the form:
(22) 
where denotes the timeshift in each trajectory , is the current position in time, and the frames between inclusively and exclusively. We then randomly assign each block of the running series to one of the running covariances.
To estimate a set of covariance matrices, we perform running moments to
covariance matrices. For each pair of blocks as yielded by the sliding window with respect to (22), we draw one matrix and update its blocks with the running moment, i.e.,(23)  
where denotes is the total number of time steps contained in all blocks, and .
To compute a cross validated score, we split the covariance matrices into disjoint training and test sets with indices according to a fold strategy in which we subdivide the set into groups and choose one of the groups as test set. We then aggregate the covariance matrices,
(24)  
(25)  
(26) 
with an appropriate weight , according to the number of samples in the relevant set. Then we calculate the VAMP score for both the training and test data sets, yielding scores , . Utilizing the VAMP score from Eqn. (21) and letting , one can obtain a CV score. The approximated prediction/consistency error is then .
Thus, following the notation in Sec. II and Eqns. (19) and (20), we can summarize the calculation of the crossvalidated VAMP score as follows, where is the Frobenius norm ^{8}^{8}8This is because the Frobenius norm is equivalent to the Schatten norm for ; see Ref. Wu and Noé, 2017, Sec. 3.2., and and are calculated from the training set:
(27) 
We note that the construction of an MSM is not required to obtain the score.
iii.3 Markov state models
To verify the previously computed VAMP scores on the input feature space, we build MSMs for Homeodomain, Protein G, and WW domain. We expect to see a similar relative score for the MSMs constructed from different feature sets, which is related to the timescales of the scored dynamical processes.
After obtaining the VAMP singular vectors as described above, the MSM estimation pipeline can be described as follows: prior to clustering, we project onto the first five right singular vectors of the VAMP basis, analogously to the established TICA method PérezHernández et al. (2013). We then cluster this space into a set of cluster centers via means, and estimate a maximum likelihood, reversible MSM on this discretization. For each MSM we perform a block splitting of the discrete trajectories to perform a crossvalidated VAMP2 scoring on the MSM as described in Sec. II and Ref. Wu and Noé, 2017. For further analysis we switch from the maximum likelihood estimate to a Bayesian approach TrendelkampSchroer et al. (2015), in order to compute errors. We evaluate the timescales and some representative structures for the slowest processes. Finally, we want to ensure that the models have predictive power under the Markovian approximation, so we evaluate the implied timescales and conduct a ChapmanKolmogorov test (recall Eqns. (2) and (3)). This procedure is outlined in Fig. 2.
Iv Results
In Fig. 3 we show VAMP2 scores for the five slowest processes for all twelve fastfolding proteins at a lag time of ns computed in the input space as described in the previous section. For the systems BBA, Villin, TrpCage, Chignolin, and Protein B, the combination of flexible torsions and the residue contact distance transformation (feature definition 6) yields the best overall score (henceforth referred to as the “combined” feature set). For Homeodomain, we find three almost equal scores for three different features, namely residue minimum distances and its transformation , and the combined feature. For repressor, Homeodomain, and a3D the untransformed contact distances are superior, closely followed by distances transformed by (henceforth simply “transformed distances”). For BBL, the distances transformed by perform best. The contactbased features scores increase with the magnitude of the cutoff. For all systems, the perresidue SASA feature yielded the lowest score.
To visualize results for a couple of example systems, we consider Homeodomain and Protein G. In Figs. 4 and 5, we show the timescales for a set of three different means clusterings, namely , , and cluster centers. The slowest processes are consistently covered best by the combined feature set. Both systems show a slight increase of the timescale of the slowest process for finer discretizations, which is expected in the absence of crossvalidation Sarich, Noé, and Schütte (2010); McGibbon and Pande (2015). The timescales of the second and following slowest processes appear to be more stable with respect to to the number of cluster centers.
Our analysis shows that the torsions feature alone can capture only processes happening on timescales half as long as those processes that can be described after combining torsions with the transformed contact distances. The transformed contact distance feature alone is relatively effective, but its effectiveness is increased by combining it with flexible torsions. Aligned Cartesian coordinates yield timescales in same regime as the torsion model. The SASA feature does not capture any slow processes, as we already expected from the low score in the feature space. Construction of an MSM for further analysis should involve investigating many state decompositions (i.e., numbers of microstates) under crossvalidation McGibbon and Pande (2015). An optimized state decomposition will (by definition) increase the timescales obtained by the MSM, while the use of a training/validation dataset split will in general decrease the timescales for every state decomposition by avoiding overfitting.
We note that the slowest process in the bestperforming Protein G MSMs is extremely slow ( ms). We visualize the dominant process of a 2000microstate MSM built from transformed distances in Fig. 6 and see that the process represents conversion between the folded structure (B; red) and an unfolded structure (A; blue) consistent regions of order and disorder. It has been demonstrated through the use of subsequent simulations that this dataset is undersampled Schwantes, Shukla, and Pande (2016), so it is likely that this process is observed only a few times and does not represent folding in general ^{9}^{9}9We constructed MSMs with crossvalidated state decompositions and still observed this timescale..
Although we restrict our analysis to the most dominant process, we can see that the differences in MSM VAMP scores for 1000microstate MSMs built from different features correspond to the trends observed in the previously calculated VAMP feature scores (Fig. 7). The VAMP analysis of feature sets can be viewed as pessimistic relative to the full MSM variational analysis (see discussion in Sec. V, to follow). However, while MSMs tend to improve or preserve variational scores, this is observed to occur across the board, without causing a change in trends. Thus, we observe that, for the goal of choosing the bestperforming MSM, we likely could have restricted ourselves to building models from only the bestperforming features in the initial analysis.
For all systems except WW domain, the aligned Cartesian coordinates (XYZ) show a worse performance than distancebased features. The WW domain, however, shows the largest score for aligned Cartesian coordinates, which is significantly higher than all the other feature sets. Because we only observe this result in one system, we take a closer look and plot the first twenty singular values for three feature sets (Fig. 8) as well as the five slowest MSM processes obtained using five representative feature sets and three different numbers of clusters (Fig. 9). By visualizing the extremes of the dominant MSM eigenvector, we see that the aligned coordinates capture the folding process (Fig. 10). In contrast, the slowest process of the combined feature only picks up rather fast loop movements, with the three sheets already formed^{10}^{10}10 These results were verified for several starting structures of WW domain in addition to the one visualized in Fig. 1.. This discrepancy shows why we see a significantly different VAMP score in feature space for aligned Cartesian coordinates in contrast to the other features.
Although the feature VAMP singular values are consistently higher for the WW domain when evaluated on the aligned Cartesian coordinates (Fig. 8), the MSM timescales show that transformed distances and the combined feature produce similar timescales for the slowest process in the MSM (Fig. 9). However, for subsequent processes, the MSM timescales decrease for transformed distancebased features, but remain high for the aligned Cartesian coordinates. This shows why the Cartesian coordinates perform the best according to the VAMP score, which we recall incorporates the sum of multiple singular values using Eqn. (21).
When we compare the ability to build a converged MSM with both Cartesian coordinates and the combined feature (Figs. S1 and S2 in the supplementary materials), we observe better performance for the combined feature in terms of long term predictions of the kinetics according to the ChapmanKolmogorov test. In our discussion of the WW domain results, we have performed representative analyses for selection, validation, and investigation of a kinetic model. Before using a kinetic model of this (or any) system built from any of these feature sets, it would be necessary to further investigate the meaningfulness of the processes in a given feature set in the context of the desired analysis.
V Discussion
The introduction of variational principles into MD analyses over the past five years have provided the tools to optimize the approximation quality of not only reversible MSMs, but also the features used to construct reversible or nonreversible dynamical models from features, either as models themselves or as a step toward eventual MSM construction. With these tools, we have presented a rigorous algorithm to optimize the feature selection step in traditional MSM construction. While subsequent steps such as TICA, cluster discretization, and transition matrix approximation have been relatively optimized, feature selection has remained a challenge for constructing optimal models.
The method presented in this study demonstrates how recent algorithmic advances enable objective evaluation given a set of feature options. By optimizing features before building MSMs, we can evaluate their performance directly and more efficiently, instead of in combination with whatever other modeling choices are needed to obtain the MSM. We showed that selecting features using the VAMP differentiates them according to the approximation quality of the Koopman matrix obtained from those features. Constructing MSMs from these feature sets (i.e., from their right singular VAMP vectors) and evaluating their timescales maintains and verifies the ranking from performing the VAMP scoring on the features themselves. We expect that the practical use of this algorithm will be to quickly eliminate poorly performing features sets, so that further optimization can be performed on subsequent parameter choices using the better options for features. Our results show that contact distances transformed by combined with flexible torsions is a consistently wellperforming feature set for the folding of small globular proteins, and are likely to perform well for a broader class of conformational transitions in biomolecules. Additional and more specific optimization of features for a class of molecular systems can be done by following the methodology described here.
As shown, constructing MSMs tends to improve model approximation quality due to the state discretization (e.g., transition regions can be finely discretized, which is known to produce better models Sarich, Noé, and Schütte (2010)). The discretization process involved in MSM construction will affect features differently depending on the extent to which they are already discretized. For example, nontransformed contact distances will be finely discretized along the coordinate according to the MSM states, whereas a binary contact map cannot be finely discretized with the MSM state decomposition. Thus, a continuous coordinate may demonstrate substantial improvement according to a variational score upon MSM construction, whereas an alreadydiscretized coordinate is not expected to improve with MSM construction.
Since the first applications of the MSM framework to the analysis of biomolecules simulated with MD, the decision of how to transform the raw Cartesian coordinates from a simulation dataset into wellperforming features has been an omnipresent challenge. While the categorical aspect of feature selection prohibits pure automation (except, perhaps, in the case of treatment with neural networks
Mardt et al. (2018)), the method presented in this work enables a direct evaluation of features from which predictive models can be more efficiently built.Acknowledgements
We are grateful to Tim Hempel, Nuria Plattner, Andreas Mardt, and Simon Olsson for helpful discussions. Funding is acknowledged from 760 European Commission (ERC CoG 772230 “ScaleCell”), 761 Deutsche Forschungsgemeinschaft (SFB 1114 Projects 762 C03 and A04, SFB 740 Project D07, NO825/22), and 763 the MATH+ cluster (Project EF12).
References
References
 Hänggi and Talkner (1983) P. Hänggi and P. Talkner, “Memory index of firstpassage time: a simple measure of nonMarkovian character,” Phys. Rev. Lett. 51, 2242 (1983).
 Shalloway (1996) D. Shalloway, “Macrostates of classical stochastic systems,” J. Chem. Phys. 105, 9986–10007 (1996).
 Du et al. (1998) R. Du, V. S. Pande, A. Y. Grosberg, T. Tanaka, and E. S. Shakhnovich, “On the transition coordinate for protein folding,” J. Chem. Phys. 108, 334–350 (1998).
 Rohrdanz, Zheng, and Clementi (2013) M. A. Rohrdanz, W. Zheng, and C. Clementi, “Discovering mountain passes via torchlight: methods for the definition of reaction coordinates and pathways in complex macromolecular reactions,” Annu. Rev. Phys. Chem. 64, 295–316 (2013).
 Prinz, Chodera, and Noé (2014) J.H. Prinz, J. D. Chodera, and F. Noé, “Spectral rate theory for twostate kinetics,” Phys. Rev. X 4, 011020 (2014).
 Noé and Clementi (2017) F. Noé and C. Clementi, “Collective variables for the study of longtime kinetics from molecular trajectories: theory and methods,” Curr. Opin. Struct. Biol. 43, 141–147 (2017).
 Schütte et al. (1999) C. Schütte, A. Fischer, W. Huisinga, and P. Deuflhard, “A direct approach to conformational dynamics based on hybrid monte carlo,” J. Comput. Phys. 151, 146–168 (1999).
 Swope, Pitera, and Suits (2004) W. C. Swope, J. W. Pitera, and F. Suits, “Describing protein folding kinetics by molecular dynamics simulations. 1. theory,” The Journal of Physical Chemistry B 108, 6571–6581 (2004).
 Singhal, Snow, and Pande (2004) N. Singhal, C. D. Snow, and V. S. Pande, “Using path sampling to build better Markovian state models: predicting the folding rate and mechanism of a tryptophan zipper beta hairpin,” J. Chem. Phys. 121, 415–425 (2004).
 Noé et al. (2007) F. Noé, I. Horenko, C. Schütte, and J. C. Smith, “Hierarchical analysis of conformational dynamics in biomolecules: transition networks of metastable states,” J. Chem. Phys. 126, 04B617 (2007).

Chodera et al. (2007)
J. D. Chodera, N. Singhal, V. S. Pande, K. A. Dill, and W. C. Swope, “Automatic discovery of metastable states for the construction of markov models of macromolecular conformational dynamics,” J. Chem. Phys.
126, 04B616 (2007).  Prinz et al. (2011) J.H. Prinz, H. Wu, M. Sarich, B. Keller, M. Senne, M. Held, J. D. Chodera, C. Schütte, and F. Noé, “Markov models of molecular kinetics: Generation and validation,” J. Chem. Phys. 134, 174105 (2011).
 Sriraman, Kevrekidis, and Hummer (2005) S. Sriraman, I. G. Kevrekidis, and G. Hummer, “Coarse master equation from bayesian analysis of replica molecular dynamics simulations,” J. Phys. Chem. B 109, 6479–6484 (2005).
 Buchete and Hummer (2008) N.V. Buchete and G. Hummer, “Coarse master equations for peptide folding dynamics,” J. Phys. Chem. B 112, 6057–6069 (2008).
 Rohrdanz et al. (2011) M. A. Rohrdanz, W. Zheng, M. Maggioni, and C. Clementi, “Determination of reaction coordinates via locally scaled diffusion map,” J. Chem. Phys. 134, 03B624 (2011).
 Peters (2006) B. Peters, “Using the histogram test to quantify reaction coordinate error,” J. Chem. Phys. 125, 241101 (2006).
 E and VandenEijnden (2006) W. E and E. VandenEijnden, “Towards a theory of transition paths,” J. Stat. Phys. 123, 503 (2006).
 Hummer (2005) G. Hummer, “Positiondependent diffusion coefficients and free energies from bayesian analysis of equilibrium and replica molecular dynamics simulations,” New J. Phys. 7, 34 (2005).
 Chodera et al. (2006) J. D. Chodera, W. C. Swope, J. W. Pitera, and K. A. Dill, “Longtime protein folding dynamics from shorttime molecular dynamics simulations,” Multiscale Model. Simul. 5, 1214–1226 (2006).
 Kube and Weber (2007) S. Kube and M. Weber, “A coarse graining method for the identification of transition rates between molecular conformations,” J. Chem. Phys. 126, 024103 (2007).

Altis et al. (2007)
A. Altis, P. H. Nguyen, R. Hegger, and G. Stock, “Dihedral angle principal component analysis of molecular dynamics simulations,” J. Chem. Phys.
126, 244111 (2007).  Noé et al. (2009) F. Noé, C. Schütte, E. VandenEijnden, L. Reich, and T. R. Weikl, “Constructing the equilibrium ensemble of folding pathways from short offequilibrium simulations,” Proc. Natl. Acad. Sci. 106, 19011–19016 (2009).
 Jayachandran et al. (2006) G. Jayachandran, M. R. Shirts, S. Park, and V. S. Pande, “Parallelizedoverparts computation of absolute binding free energy with docking and molecular dynamics,” J. Chem. Phys. 125, 084901 (2006).
 Yang and Roux (2008) S. Yang and B. Roux, “Src kinase conformational activation: thermodynamics, pathways, and mechanisms,” PLoS Comput. Biol. 4, e1000047 (2008).
 Muff and Caflisch (2008) S. Muff and A. Caflisch, “Kinetic analysis of molecular dynamics simulations reveals changes in the denatured state and switch of folding pathways upon singlepoint mutation of a sheet miniprotein,” Proteins: Struct., Funct., Bioinf. 70, 1185–1195 (2008).
 Sarich, Noé, and Schütte (2010) M. Sarich, F. Noé, and C. Schütte, “On the approximation quality of Markov state models,” Multiscale Model. Simul. 8, 1154–1177 (2010).
 Noé and Nüske (2013) F. Noé and F. Nüske, “A variational approach to modeling slow processes in stochastic dynamical systems,” Multiscale Model. Simul. 11, 635–655 (2013).
 Nüske et al. (2014) F. Nüske, B. G. Keller, G. PérezHernández, A. S. Mey, and F. Noé, “Variational approach to molecular kinetics,” J. Chem. Theory Comput. 10, 1739–1752 (2014).
 Molgedey and Schuster (1994) L. Molgedey and H. G. Schuster, “Separation of a mixture of independent signals using time delayed correlations,” Phys. Rev. Lett. 72, 3634 (1994).
 PérezHernández et al. (2013) G. PérezHernández, F. Paul, T. Giorgino, G. De Fabritiis, and F. Noé, “Identification of slow molecular order parameters for Markov model construction,” J. Chem. Phys. 139, 07B604_1 (2013).
 Noé and Clementi (2015) F. Noé and C. Clementi, “Kinetic distance and kinetic maps from molecular dynamics simulation,” J. Chem. Theory Comput. 11, 5002–5011 (2015).
 Noé, Banisch, and Clementi (2016) F. Noé, R. Banisch, and C. Clementi, “Commute maps: Separating slowly mixing molecular configurations for kinetic modeling,” J. Chem. Theory Comput. 12, 5620–5630 (2016).
 PérezHernández and Noé (2016) G. PérezHernández and F. Noé, “Hierarchical timelagged independent component analysis: computing slow modes and reaction coordinates for large molecular systems,” J. Chem. Theory Comput. 12, 6118–6129 (2016).
 Schwantes and Pande (2015) C. R. Schwantes and V. S. Pande, “Modeling molecular kinetics with tica and the kernel trick,” J. Chem. Theory Comput. 11, 600–608 (2015).
 Harmeling et al. (2003) S. Harmeling, A. Ziehe, M. Kawanabe, and K.R. Müller, “Kernelbased nonlinear blind source separation,” Neural Comput. 15, 1089–1124 (2003).
 Wu et al. (2017) H. Wu, F. Nüske, F. Paul, S. Klus, P. Koltai, and F. Noé, “Variational koopman models: slow collective variables and molecular kinetics from short offequilibrium simulations,” J. Chem. Phys. 146, 154104 (2017).
 Wu and Noé (2017) H. Wu and F. Noé, “Variational approach for learning Markov processes from time series data,” arXiv preprint arXiv:1707.04659 (2017).
 McGibbon and Pande (2015) R. T. McGibbon and V. S. Pande, “Variational crossvalidation of slow dynamical modes in molecular kinetics,” J. Chem. Phys. 142, 03B621_1 (2015).
 Scherer et al. (2015) M. K. Scherer, B. TrendelkampSchroer, F. Paul, G. PérezHernández, M. Hoffmann, N. Plattner, C. Wehmeyer, J.H. Prinz, and F. Noé, “Pyemma 2: A software package for estimation, validation, and analysis of Markov models,” J. Chem. Theory Comput. 11, 5525–5542 (2015).
 Husic et al. (2016) B. E. Husic, R. T. McGibbon, M. M. Sultan, and V. S. Pande, “Optimized parameter selection reveals trends in Markov state models for protein folding,” J. Chem. Phys. 145, 194103 (2016).
 Mardt et al. (2018) A. Mardt, L. Pasquali, H. Wu, and F. Noé, “Vampnets for deep learning of molecular kinetics,” Nat. Commun. 9, 5 (2018).
 Kubelka, Hofrichter, and Eaton (2004) J. Kubelka, J. Hofrichter, and W. A. Eaton, “The protein folding ‘speed limit’,” Curr. Opin. Struct. Biol. 14, 76–88 (2004).
 LindorffLarsen et al. (2011) K. LindorffLarsen, S. Piana, R. O. Dror, and D. E. Shaw, “How fastfolding proteins fold,” Science 334, 517–520 (2011).
 Beauchamp et al. (2012) K. A. Beauchamp, R. McGibbon, Y.S. Lin, and V. S. Pande, “Simple fewstate models reveal hidden complexity in protein folding,” Proc. Natl. Acad. Sci. 109, 17807–17813 (2012).
 Dickson and Brooks III (2013) A. Dickson and C. L. Brooks III, “Native states of fastfolding proteins are kinetic traps,” J. Am. Chem. Soc. 135, 4729–4734 (2013).
 Weber, Jack, and Pande (2013) J. K. Weber, R. L. Jack, and V. S. Pande, “Emergence of glasslike behavior in Markov state models of protein folding dynamics,” J. Am. Chem. Soc. 135, 5501–5504 (2013).
 Husic and Pande (2018) B. E. Husic and V. S. Pande, “Markov state models: From an art to a science,” J. Am. Chem. Soc. 140, 2386–2396 (2018).
 (48) For more details on this estimation, see Ref. wu2017variational, §II C.
 (49) Ref. wu2017vamp, Thm. 1.
 (50) To see the equivalence, as an intermediate step write .
 (51) While the Koopman reweighting estimator introduced in Ref. wu2017variational removes bias, it has a relatively large variance.
 (52) See Ref. wu2017vamp, Appendix F for details.
 (53) When the whitening as suggested in Ref. wu2017vamp is used, and become identity matrices when calculated from the whitened data. However, other whitening procedures (or their omission when and are already invertible) will produce different results, hence our more general formulation.
 Gross, Oliveira, and Kohn (1988) E. K. Gross, L. N. Oliveira, and W. Kohn, “Rayleighritz variational principle for ensembles of fractionally occupied states,” Phys. Rev. A 37, 2805 (1988).
 (55) The folded structures were chosen visually, before analysis, to replicate a naïve choice of reference frame.
 Shrake and Rupley (1973) A. Shrake and J. Rupley, “Environment and exposure to solvent of protein atoms. lysozyme and insulin,” J. Mol. Biol. 79, 351–371 (1973).
 McGibbon et al. (2015) R. T. McGibbon, K. A. Beauchamp, M. P. Harrigan, C. Klein, J. M. Swails, C. X. Hernández, C. R. Schwantes, L.P. Wang, T. J. Lane, and V. S. Pande, “Mdtraj: a modern open library for the analysis of molecular dynamics trajectories,” Biophys. J. 109, 1528–1532 (2015).
 Chan, Golub, and LeVeque (1982) T. F. Chan, G. H. Golub, and R. J. LeVeque, “Updating formulae and a pairwise algorithm for computing sample variances,” in COMPSTAT 1982 5th Symposium held at Toulouse 1982 (Springer, 1982) pp. 30–41.
 (59) This is because the Frobenius norm is equivalent to the Schatten norm for ; see Ref. wu2017vamp, Sec. 3.2.
 TrendelkampSchroer et al. (2015) B. TrendelkampSchroer, H. Wu, F. Paul, and F. Noé, “Estimation and uncertainty of reversible Markov models,” J. Chem. Phys. 143, 11B601_1 (2015).
 Schwantes, Shukla, and Pande (2016) C. R. Schwantes, D. Shukla, and V. S. Pande, “Markov state models and tICA reveal a nonnative folding nucleus in simulations of NuG2,” Biophys. J. 110, 1716–1719 (2016).
 (62) We constructed MSMs with crossvalidated state decompositions and still observed this timescale.
 (63) These results were verified for several starting structures of WW domain in addition to the one visualized in Fig. 1.
Comments
There are no comments yet.