I Introduction
Recent advances in computer hardware and software has recently enabled the generation of extensive and high throughput molecular dynamics (MD) trajectories.lindorff2011fast; shaw2021anton
These facilitates the thermodynamic and kinetic study of biomolecular processes such as protein folding, proteinligand binding and conformational dynamics to name a few. These simulations often produce large amount of high dimensional data which require special rigorous techniques for analyzing the thermodynamics and kinetics of the molecular processes.
In recent years, Markov state modeling approach swope2004describing; husic2018markov; mcgibbon2015variational; prinz2011markov has been greatly developed and utilized for understanding longtimescale behavior of dynamical systems and stateofthe art software packages such as Pyemmascherer2015pyemma and MSMBuilder harrigan2017msmbuilder were introduced. Markov state models provide a master equation that describes the dynamic evolution of the system using a simple transition matrix.scherer2015pyemma Markovianity in these system means the kinetics are modeled by memoryless jumps between states in the state space. Combined with the advances in the MD simulations, the framework for MSM construction has been greatly advanced to a robust set of methods to analyze a dynamical system. In an MSM the molecular conformation space is discretized into coarsegrain states, where the interconversions between microstates within a macrostate are fast compared to transitions between different macrostates.sidky2019high Markov state models have previously been used to investigate kinetics and thermodynamic properties of biophysical systems such as protein folding noe2009constructing; bowman2009progress; voelz2010molecular; beauchamp2012simple, proteinligand held2011mechanisms; plattner2015protein binding and protein conformational changes.kohlhoff2014cloud; banerjee2015conformational; sadiq2012kinetic
There are several steps in the pipeline of Markov model construction: The first step involves featurization where relevant MD coordinates such as distances, contact maps or torsion angles are chosen.mcgibbon2015mdtraj
This is followed by a dimension reduction step that maintains the slow collective variables using methods such as time independent component analysis (TICA)
perez2013identification; schwantes2013improvements or dynamic mode decomposition (DMD) mezic2005spectral; schmid2010dynamic; tu2013dynamic or other variants of these techniques. The resulting lowdimensional space is then discretized into discrete states.chodera2007automatic; wu2015gaussian; weber2017set This is usually done using Kmeans clustering.scherer2015pyemma A transition matrix is then built on the discretized trajectories which describes the time evolution of processes using a lagtime. prinz2011markov; bowman2009progress This transition matrix can be further processed through its Eigendecomposition to find the equilibrium and kinetic properties of the system.prinz2011markov Finally, fuzzy clustering methods such as PCCA are often used to produce a more interpretable coarsegrained model.roblitz2013fuzzyAs noted above, there are multiple steps where hyperparameters must be carefully chosen to construct the Markov model. The quality of the constructed MSM is highly dependent on these steps and this has brought many research opportunities into optimizing the pipeline of MSM using various techniques.
prinz2011markov; noe2013variational; schwantes2013improvements; husic2017ward; mcgibbon2015variational; husic2016optimized; scherer2019variational Moreover, complex dynamical systems require the optimal choice of model parameters which requires physical and chemical intuition about the model. Therefore, suboptimal choices of model parameters can lead to poor results in learning the dynamics from trajectory. Recently a variational approach for conformational dynamics (VAC) has been proposed which help in selection of optimal Markov models by defining a score that measures the optimality of the given Markov model for governing the true kinetics.wu2017variational; noe2013variational; scherer2019variational; nuske2014variational; scherer2019variational; mcgibbon2015variational VAC states that given a set of n orthogonal functions of state space, their time autocorrelations at lagtimeare the lower bounds to the true eigenvalues of the Markov operator.
nuske2014variational This is equivalent to underestimating the relaxation time scales and overestimating the relaxation rates.wu2017variational; noe2013variational; nuske2014variational Before VAC, the tools to diagnose the performance of MSMs were mainly visual such as implied timescale plot (ITS)swope2004describing and ChapmanKolomogorov testhusic2018markov (CKtest). Variational approach enabled the objective comparison of different model choices for the same lag time. noe2013variational The VAC has recently been generalized to variational approach for Markov processes (VAMP).wu2017variational; wu2020variationalVAMP was proposed for the general case of irreversible and nonstationary time series data and is based on a singular value decomposition of the Koopman operator.
wu2020variationalUsing this variational principle, Mardt and coworkers introduced VAMPNets to replace the whole pipeline of Markov model construction with a deep learning framework.
mardt2018vampnets A VAMPNet maps the configurationto a lowdimensional fuzzy state space where each timepoint has a membership probability to each of the metastable states. VAMPNets are then optimized by a variational score (VAMP2). This framework was further developed by directly learning the eigenfuctions of the spectral decomposition of the transfer operator that propagates equilibrium probability distribution in time with a state free reversible VAMPNet (SRV).
chen2019nonlinear Physical constraints of reversibility and stochasticity were further added to the model to have a valid transition matrix enabling the computations of transition rates and other equilibrium properties from outofequilibrium simulations.mardt2020deep; mardt2021progress However, a proper representation of the molecules is not discussed in these models and traditional distance matrices or contact maps are often used.Conformational heterogeneity of proteins during folding can complicate the selection of features for building a dynamical model. This is even more true in the case of disordered proteins.lohr2021kinetic Gaining interpretable few state kinetic model of protein folding using MD trajectories is still highly desirable and can be achieved by MSM approaches using carefully chosen features. For this, one would choose a set of features for the system such as distances, dihedral angels or the rootmeansquare deviation to some reference structure. Using more complicated feature functions such as convolutional layers on the distance matrices was proposed to enhance the kinetic resolution of the model.lohr2021kinetic Graph neural network have previously been used in molecular feature representation as a promising tool in a variety of applications to predict the properties of the system of interest or energies and forces.schutt2017schnet; husic2020coarse Battaglia first introduced graph neural networks with convolutional operations and graph message passing.battaglia2018relational; kipf2016semi Currently there are various types of graph neural networks that differ in how the message passing operations are done between nodes and edges and how the output of the network is generated. Traditionally, distance maps or contact maps were used to represent the structure of the molecules. A more natural way of representing proteins is by using graphs where nodes represents atoms and the edges representing the bonds (real or unreal) connecting them. This representation is rotationally invariant by construction. Recent advances of protein structure prediction has greatly exploited the advances in geometric deep learning bronstein2017geometric and graph neural networks kipf2016semi; bruna2013spectral. Combining the VAMPNet framework with graph neural network improves the kinetic resolution of the resulting low dimensional model where a smaller lagtime can be chosen to build the transition matrix. In the original VAMPNet the dynamics is directly coarsegrained into a few coarsegrained states without learning a lowdimensional latent space representation. However, using graph neural networks, we show that the learned embeddings for the graphs can represent useful information about the dynamic system. Furthermore, using a graph attention network velivckovic2017graph gives us useful insights about the importance of different nodes and edges for different metastable states.
Ii Methods
A Markov model estimates the dynamics by a transition density which is the probability density of transition to state
at time given that the system was at state at time :(1) 
Where and are two different states of the system and is the lagtime of the model from which the transition probability density is built. Using this definition of transition density the time evolution of the ensemble of states in the system can be written as:
(2) 
In this equation acts as a propagator which propagates the dynamics of the system in time. However, this definition of propagator assumes a reversible and stationary dynamical systems.prinz2011markov For the general case of nonreversible and nonstationary dynamics, Koopman operator is used.wu2017variational Koopman theory enables feature transformations into a feature space where the dynamics evolve on average linearly. Koopman operator acts like a transition matrix for nonlinear dynamics and describes the conditional future expectation values for a fixed lag time . In the Koopman theory the Markov dynamics at a lagtime is approximated by a linear model of the following form:
(3) 
In equation above and are feature transformations to a space where dynamics evolve on average linearly. This approximations is exact in the infinitedimensional feature transformations however it was shown that given a large enough lagtime the low dimensional feature transformations can become optimal.wu2017variational Equation 3 can be interpreted as a finite rank approximation of the socalled Koopman operator.mezic2013analysis The optimal Koopman matrix to minimize the regression error from equation 3 is:
(4) 
Where the meanfree covariance matrices for data transformation are defined as:
(5) 
However, the regression error has no information about the choice of feature transformations and and can lead to trivial solutions for these feature transformations.mardt2018vampnets
On the other hand, VAMP provides useful scoring functions that can be used to find optimal feature transformations. VAMP is based on singular value decomposition of Koopman operator and is used to optimize the feature functions and does not have the limitations of time reversibility and stationary dynamics. VAMP states that given a set of orthogonal candidate functions, their timeautocorrelations are the lower bounds to the true Koopman eigenvalues. This provides a variational score such as the sum of estimated eigenvalues that can be maximized to find the optimal kinetic model. Wu and Noe showed that the optimal choice of
and in equation 3 are obtained using the singular value decomposition of the Koopman matrix and setting and to its top left and right singular functions respectively.wu2017variational; wu2020variationalThe VAMP2 score is then defined as :
(6) 
The left and right singular functions of the Koopman operator are always equal to the constant function 1. Therefore, 1 is added to basis functions. Maximum VAMP2 score is achieved when the top m left and right Koopman singular functions are in the span (, …) and (, …,
) respectively. VAMP2 also maximizes the kinetic variance captured by the model.
Feature transformations and can be learned with neural network in the so called VAMPNet where there are 2 parallel lobes each receiving MD configurations and . As done in the original VAMPNet, we assume the two lobes have similar parameters and use a unique basis set . The training is done by maximizing the VAMP2 score to learn the lowdim state space produced by a softmax function. Since K is a Markovian model it is expected to fulfill the ChapmanKolmogrov (CK) equation
(7) 
for any value n>=1 where and indicate models estimated at a lag time of and respectively.
The implied timescales of the process are computed as follows:
(8) 
Where is the eigenvalue of the Koopman matrix built at a lagtime . The smallest lagtime is chosen where the implied timescales are approximately constant in . After having chosen the lagtime we test whether the CK test holds within statistical uncertainty.
ii.1 Protein Graph representation
Each structure is represented in terms of an attributed graph where are node features and are the edge features that captures the relations between nodes. We have tested different Graph Neural networks (GNNs) for their ability to learn higher resolution kinetic models from MD trajectories of protein folding simulations.xie2019graph In all these different GNNs the node embeddings are initialized randomly and the edge embeddings are the Gaussian expanded distances between the adjacent nodes using the following formula:
(9) 
Where is the distance between atoms and , Å, , Å. and are the maximum and minimum distances respectively for constructing the gaussian expanded edge features. Unless noted otherwise, all the graphs are built using M nearest neighbors for the atoms of the protein with edges built from the gaussian expanded distances between atoms.
ii.1.1 Graph Convolution layer
In this type of graph neural network, protein graph is represented as where contains features of the nodes and
contains the edge attributes of the graph. A separate graph is constructed for configurations at each timestep of the simulation. The initial node feature representations are randomly initialized. However, a onehot vector representation based on the atom type or the amino acid type can also be used. During training the node embeddings
for node at layer are updated using the following equations:sanyal2020proteingcn; xie2018crystal(10) 
(11) 
(12) 
Where denotes elementwise multiplication and denotes concatenation,
is the sigmoid function as the nonlinearity and
is the edgegating mechanism introduced by Marcheggianimarcheggiani2017encoding to incorporate different interaction strength among neighbors into the model. , , and are gate weight matrix, convolution weight matrix, gate bias and convolution bias respectively for k’th layer of the graph convolution layer. To capture the embedding of the whole graph, we use graph pooling where graph embeddings are generated using the learnt node embeddings.hamilton2017inductive; ranjan2020asap The embedding for the whole graph is done through a pooling function where we average over the embedding of all the nodes.(13) 
Other types of pooling such as hierarchical pooling can also be applied to make a more complicated model.ying2018hierarchical
ii.1.2 SchNet
Another type of GNN is SchNet which was introduced by Schutt and others to use continuousfilter convolutions for predicting forces and energies of small molecules according to quantum mechanical models.schutt2017schnet; battaglia2018relational. This was modified by Husic et al.husic2020coarse
to learn a coarsegrained representation of molecules using graph Neural network. SchNet is employed here to learn feature representation of nodes for learning dynamics of protein folding. This is a subunit of our model for learning feature representations of molecules on the graph level. The initial node features or embeddings are initialized randomly but a onehot encoding based on node type (amino acid) can also be used. These embeddings are later learned and updated during training of the network by a few rounds of message passing through nodes and edges of the graph. Node embeddings are updated in multiple interaction blocks as implemented in the original SchNet.
schutt2017schnetEach interaction layer contains continuous convolution between nodes. The edge attributes are obtained by using a radial basis function e.g. Gaussians centered at different distances.
(14) 
These edge attributes are then fed into a filtergenerating network that maps to a dimensional filter. This filter is then applied to node embeddings as (continuous filter convolution):
(15) 
(16) 
(17) 
is a dense neural network and b is an atomwise linear layer as noted in the original paper.schutt2017schnet Note that the sum is over every atom in the neighborhood of atom . Multiple interaction blocks allow all atoms to interact with each other in the network and therefore allow the model to express complex multibody interactions. We enhanced the standard SchNet architecture by adding an attention layer that learns the importance of the edge embeddings for updating the embedding of incoming node in the next layer. The attention weight is learned using a softmax function between embedding of the node and its neighbors. denotes the concatenation of node features of neighboring nodes where is the query node. The node embeddings are updated in each interaction block, which can contain a residual block to avoid gradient annihilation as done in deep neural networks.he2016deep
The residual connection is followed by a nonlinear activation function of the output of continuous filter convolution
as:(18) 
The trainable function g involves linear layers and a nonlinearity. We used a hyperbolic tangent as the activation as proposed by Husic et al.husic2020coarse The output of final SchNet interaction block is fed into an atomwise dense network. The learned embedding of nodes after several SchNet layers is then fed into a pooling layer as described previously to produce a graph embedding for each timestep.
ii.2 Model selection and Hyperparameters
In GraphVAMPNet, instead of traditional features such as dihedral angles, distance matrices and contact maps, we use a general graph neural network, a more natural representation of molecules and proteins. We have implemented two different graph neural networks (GraphConvLayer and SchNet). The GraphVAMPNet built from each of these GNN layers have several model hyperparameters including the dimension of feature space (number of output states) and the lag time . To resolve relaxation timescales, we need at least
output neurons in the last layer of the network since the softmax function removes one degree of freedom. The models are trained by maximizing the VAMP2 score on the training set and hyperparameters are optimized using a a crossvalidated VAMP2 score for the validation set using a ratio of 0.7 for training and 0.3 for the validation set. To have a fair comparison between different feature representations we trained model with similar number of layers (4) and similar number of neurons per layer (16). In general, increasing the dimension of feature space makes the dynamical model more accurate, but it may result in overfitting when the dimension is very large. A higher dimensional feature space is also harder to interpret as the model seeks a low dimensional representation. Therefore, in this study, we experiment on systems with 5state output models unless stated otherwise. There are multiple hyperparameters in the model that must be selected. These include the architecture of GCN, the number of clusters, number of neighbors for making the graph, time step of analyzing the simulation. Here we have used 5 clusters for TrpCage and NTL9 and 4 clusters for Villin. In the case of Villin using a 5state led to finding only 4 states after using a cutoff value of 0.95 on the cluster probabilities which is the reason we have used a value of 4 for this protein. The hyperparameters chosen for each protein are shown in table 1.
system 



Batchsize 





TrpCage  4  16  5  1000  0.0005  20  7  16  
Villin  4  16  4  1000  0.0005  35  10  16  
NTL9  4  16  5  1000  0.0005  39  10  16 
Iii Results
We tested the performance of our GraphVampNet method on 3 different protein folding systems including TrpCage (pdb: 2JOF)barua2008trp, Villin (pdb: 2F4K)kubelka2006sub and NTL9 (pdb: 2hba)cho2014energetically
. The Graph Neural network was implemented using PyTorch and the deeptime
hoffmann2021deeptime package was used for VAMPNet. Pyemmascherer2015pyemma was used for free energy landscape plots. Adam was used as the optimizer in all models. GNN provides a framework to learn the feature transformations in VAMPNets that is invariant to permutation, rotation and reflection. Moreover, the graph embeddings can be learned with the GraphVAMPNet framework which correspond to different dynamic states during the simulation. In order to visualize the graph embeddings in 2D we have also transformed the graph embeddings in the last layer of GraphVAMPNet into 2D and trained the model by maximizing the VAMP2 score. The free energy landscape on the graph embeddings shows highly separated metastable states separated by high energy transition regions. The low energy metastable states correspond to the regions of high fidelity for metastable assignment probabilities with higher than 0.95. It is important to note that this is only done for visualization purposes and higher dimensional (16) embeddings are used for finding the metastable states in these complex protein folding systems. Furthermore, the present results do not depend on enforcing reversibility to the learned transition matrix. However, this can be done by Koopman reweightingwu2017variational or learning the reweighting vectors during training in the VAMPNet framework.mardt2020deepAlthough we have tried different Graph Convolution networks in the GraphVAMPNet approach, our results showed that SchNet has the best performance with the highest VAMP2 score among all. Therefore, we present the results of SchNet in the main paper. The average VAMP2 scores are calculated from the validation set 10 different training for each system and compared (table S1). The VAMP2 score for SchNet in each system for training and validation set are plotted against the training epoch and shows a converging behavior after 100 epochs (figure S1).
iii.1 Trpcage
We test our GraphVAMPNet model an ultralong 208 explicit solvent simulation of K8A mutation of the 20residue TrpCage TC10b at 290 K provided by DE Shaw group.lindorff2011fast The folded state of TrpCage contains an helix (residues 28), a helix and a polyproline II helix.meuzelaar2013folding The tryptophan residue (Trp6) is caged at the center of the protein.
A VAMPNet was built for different types of feature learning in neural networks. The average VAMP2 score of the validation set for 10 training runs were compared between different types of feature learning Neural Networks (Standard VAMPNet, and two graph layers) in table S1. SchNet showed the highest average VAMP2 score among different types of features leanings. The average VAMP2 score for training and validation set of TrpCage for 10 different training examples is shown in figure S1A. The VAMP2 score shows a converging behavior after 100 epochs of training. Since SchNet showed the highest VAMP2 score we use this type of Graph Neural network for the rest of our analysis. The implied timescales learned using SchNet is shown in figure 2A. Implied timescale plots for GraphConvLayer and standard VAMPNet are shown in figures S2A,B. Standard VAMPNet using distances shows higher variance in the implied timescales than the VAMPNet built using SchNet. A faster convergence of implied timescales for SchNet is also observed compared to standard VAMPNet. All 4 timescales in SchNet converge after 20 ns which is the lagtime we choose to build the kinetic Koopman matrix. Moreover, a closer look at the implied timescales shows that the timescales learnt from standard VAMPNet layer are also smaller (e.g. the fourth timescale) than the implied timescales in SchNet. According to the variational approach for Markov processes, a model with longer implied timescale corresponds to less modeling error of the true dynamics of the system.nuske2014variational To validate the resulting GraphVAMPNet, we conduct a CKtest which compares the transition probability between pairs of states at time predicted by a model at lagtime . CKtest shows (figure 2B) excellent prediction of transition probabilities even at large timescales ns at a lagtime of 20 ns.
We next analyzed the resulting coarsegrained states built from VAMPNet using SchNet as feature transformation. The folded state (S4) possesses 18 of the total distribution and the unfolded state (S1) has 69.5 of the total distribution which is in great agreement with other studies on this dataset using Markov state models.sidky2019high; ghorbani2021variational
. GraphVAMPNet produces an embedding for each timestep of the simulation which is then turned into a membership assignment using a softmax function. This higher dimensional embedding (16 dimension) can be visualized using dimensionality reduction methods such as tSNE (figure S3A) which shows the separated clusters based on their maximum membership assignment probability. To have a better visualization of the lowdimensional space learned by the model, we also trained a GraphVAMPNet where in the last layer we linearly transformed the learned graph embedding into 2D and trained the model by maximizing the VAMP2 score. Other parameters were kept similar to the main SchNet. The 2D free energy landscape (FEL) for this embedding is shown in figure 2C. This low energy states in the FEL correspond to the states with high cluster assignment probability. This lowdimensional FEL shows the ability of the GraphVAMPNet to produce an interpretable and highly clustered embedding of graphs for simulation of proteins. The learned 2D embedding of graphs during TrpCage Folding is shown in figure 2D where the states with more than 0.95 cluster assignment probability are colored. The enhancement of the SchNet with attention gives an interpretable model where we can analyze the nodes and edges in the graph that are most important in each coarsegrained cluster. The scaled attention scores for TrpCage are shown in figure 3. The cage residue Trp6 shows a high attention score in most clusters due to being in the center of the protein and having a high number of connections in the graph. In the unfolded state (S1) most residues have high attention score only with their close neighbors on the sequence which is due to high level of dynamic and no defined structure of the unfolded state. On the other hand, other clusters such as S2 show different hot spot regions for their attention scores. In this hairpinlike structure, residues Ala4, Ser14 and Pro17 which make the groove have high attention scores. A two step folding mechanism has been proposed for TrpCage that involves an intermediate state with a salt bridge between Asp9 and Arg16.
zhou2003trp Breaking this saltbridge is thought to be a limiting step in folding of TrpCage. Surprisingly, our model puts high attention scores on residues Arg16 and Asp9 in metastable state S3 which also has a 10 probability.iii.2 Villin
Villin is a 35residue protein and is known as one of the smallest proteins that can fold autonomously. It is composed of 3 helices denoted as helix1 (residues 48), helix2 (residues 1518), helix3 (residues 2332) and a compact hydrophobic core. The double mutant of Villin with replacement of two Lys residues with uncharged Norleucine (Nle) was simulated by DE Shaw lindorff2011fast group and is studied here. Hernandez and coworkers hernandez2018variational used a variational dynamic encoder to produce a lowdimensional embedding of Villin folding trajectories using distance maps. The optimized TICA for this protein used a lagime of 44 ns according to hyperparameter optimization. husic2016optimized
We built VAMPNet with different types of feature functions and compared the average VAMP2 score of validation set for 10 different training between them. VAMPNet based on SchNet showed the highest VAMP2 score for the same number of states (4). The VAMP2 score for training and validation set are shown in figure S1B for SchNet which shows a converging behavior for 100 epochs of training. The VAMPNet built using SchNet shows an extremely fast convergence of implied timescales even after 20 ns (figure 4A) which gives a high resolution kinetic model for villin folding. On the other hand, VAMPNet built with standard VAMPNet shows a slow convergence for implied timescales where the first timescale converges after 40 ns of lagtime (figure S2). The timescales of the processes are also higher for SchNet than in standard VAMPNet model which also demonstrates the higher accuracy of GraphVAMPNet than VAMPNet with simple distance matrix. The CK test for SchNet (figure 4B) shows excellent markovian behavior of the model built using a lagtime of 20 ns at large timescales ns. GraphVAMPNet also provides a latent embedding for graphs which is another advantage of GNN features compared with standard VAMPNet layer. The 16dim graph embedding for Villin was further reduced to 2 dimensions using tSNE (figure S3B). This 2D latent embedding shows highly separated clusters. To have a more interpretable embedding we have trained a VAMPNet using SchNet where we linearly transformed the last embedding layer into 2 dimensions and trained the model using similar parameters as before. The 2D embedding of Villin learned using GraphVAMPNet is shown in figure 4D where datapoints with a cluster assignment probability higher than 0.75 are colored based on their corresponding state. The FEL on this 2D embedding is shown in figure 4C. This FEL features highly separated clusters with low energy minima corresponding to the center of clusters and the transition regions having low membership assignment probabilities. The representative structure of each cluster of Villin (Misfolded: S0, Unfolded: S1, Partiallyfolded: S2, Folded: S3) are shown in figure 5 which are colored based on the average attention score for each residue in that cluster. The folded state (S3) shows a high attention score for residue Arg13. This is in agreement with previous study by Mardt et al.mardt2021progress who used distance map features and attention on neighboring residues. We found residues Gln25 and Nle29 to also have high attention scores in the folded state. These residues are in the central hydrophobic core of the protein and have high number of connections in the graphs built for the folded state. The partially folded (S2) state has similar attention scores to the folded state except for residue Lys7 which shows high attention score in partiall folded (S2) but not in folded (S3). The misfolded state (S0) has high attention score for helix 2 residues which is also in agreement with work done by Mardt et. al.mardt2021progress In general the N and Cterminal of the protein due to their high flexibility are given low attention scores. Hernandez and coworkers hernandez2018variational used variational dynamic encoders to reduce the complex nonlinear folding of Villin into a single embedding and used a saliency map to find important Ca contacts for folding of Villin. They found residues Lys29 and His27 to be important for the folding of Villin. We found these residues to have high attention scores in our model for partially folded and folded states.
iii.3 Ntl9
As our last example, we tested the GraphVAMPNet on the NTL9 (residues 139) folding dataset from DE Shaw group.lindorff2011fast. We uniformly sampled the 1.11 ms trajectory using a lagtime of 5 ns. Mardt et al.mardt2018vampnets previously used a 5layer VAMPNet with contact maps between neighboring heavyatoms to coarsegrain the NTL9 simulation into metastable states. They showed that the relaxation timescales by a 5state VAMPNet correspond to a 40state MSM. Their implied timescales showed a converging behavior after about 320 ns which they chose as the lagtime of the Koopman matrix.
The comparison of VAMP2 score for VAMPNet built using different neural network feature transformations is shown in table S1. Standard VAMPNet based on the distance maps shows a similar VAMP2 score compared to SchNet based VAMPNet. The training and validation VAMP2 score for NTL9 is shown in figure S1C which shows a highly converging behavior after 100 epochs. The implied timescales for SchNet layer if shown in figure 6A. The convergence of implied timescale for SchNet (figure 6A) and standard VAMPNet (figure S2F) and the magnitude of the implied timescales are similar. A lagtime of 200 ns was chosen to build the koopman matrix. The CK test (figure 6C) for SchNet using a lagtime of 200 ns shows the markovianity of the model even at high timescales of 2000 ns. The tSNE for the 16 dimensional embedding of NTL9 is shown in figure S3C which shows separated clusters. As described for other proteins, we have trained a SchNet based VAMPNet for NTL9 with a 2D embedding. Figure 6D shows the 2D embedding of NTL9 which is colored by states higher than 0.95 cluster membership probability. The FEL in this 2D embedding (figure 6C) shows low energy metastable states separated by transition regions that correspond to points where model is uncertain about their membership. Representative structures of each cluster in NTL9 are shown in figure 7 (colored based on residue attention scores). The folded state (S2) and unfolded state (S0) posses 75.5 and 18.7 of the total probability distribution. Schwantes et al.schwantes2013improvements used a TICA MSM for NTL9 and showed that the slowest timescale ( 18s) corresponds to the folding process while the faster timescales correspond to transition between different registershifted states. Register shift in each strand can also shift the hydrophobic core contacts. For instance based on their study, register shift in strand 3 produces a shift in the core packing in which Phe29 is packed. Interestingly in our model, high attention scores are given to the betastrand residues such as Phe5, Phe29, Phe31 and Ile18.
Iv Discussion and Conclusion
MSM construction has previously been a complex process which involved multiple steps such as feature selection, dimension reduction, clustering, estimating the transition matrix K. Each of these steps requires choosing some hyperparameters and suboptimal choices could lead to poor kinetic model of the system with lower kinetic resolution.
scherer2019variational Variational approach for conformational dynamics (VAC) and its more general form the variational approach for Markov processes (VAMP) have recently guided the optimal choice of hyperparameters.wu2017variational; nuske2014variational; scherer2019variational A crossvalidated variational scores is usually used to find the set of features with the highest crossvalidated VAMP2 score.scherer2019variationalThe endtoend deep learning framework VAMPNet was proposed by Mardt et al.mardt2018vampnets to replace the MSM construction pipeline by training a neural network that maps the molecular configurations x to a fuzzy state space where each point has a membership probability to each of the metastable states. VAMPNet is trained by maximizing a VAMP score allowing us to find the optimal state space which enables linear propagation of states through a transition matrix. VAMPNets are not restricted to stationary and equilibrium MD and can be used as general case for nonstationary and nonequilibrium processes. The fewstate coarsegrained MSM in the case of VAMPNet is learned without the loss of model quality as is the case in standard pipelines such as PCCA.roblitz2013fuzzy Due to the endtoend nature of deep neural networks, VAMPNets require less expertise to build an MSM. The framework of VAMPNet were further developed into statefree reversible VAMPNets (SRVs) not to approximate MSMs but rather to directly learn nonlinear approximations to the slowest dynamics modes of a MD system obeying detailed balance.chen2019nonlinear In SRVs, the transfer operator rather than soft metastable state assignment, directly employs the variational approach under detailed balance to approximate the slow modes of equilibrium dynamics. Ferguson and coworkers sidky2019high showed that MSMs constructed from nonlinear SRV approximations would permit the use of shorter lagtimes and therefore furnish the models with higher kinetic resolution.
Despite the success of VAMPNet, the feature selection is still a process that must be done with caution. Traditionally distance maps are used as a general feature representation of protein dataset. However this representation does not preserve the graphlike structure of proteins as it does not capture the 3d structure and models the protein as points on a regular grid. In this work we have focused on representation learning of VAMPNet using graphrepresentation of protein to get a higherresolution kinetic model where a smaller lagtime can be chosen. Graph representation of molecules is shown to be effective in extracting different properties using deep learning. Recently there has been a large amount of work in the area of geometric deep learning bronstein2017geometric that has graph based approaches for representing graph structures. These methods enable automatic learning of the best representation (embedding) from raw data of atoms and bonds for different types of predictions. kearnes2016molecular; marcheggiani2017encoding
These methods have been applied to various tasks such as molecular feature extraction
duvenaud2015convolutional; kearnes2016molecular protein function prediction gligorijevic2021structure and protein design strokach2020fast to name a few. Park et al.park2021accurate proposed a machine learning framework (GNNFF) a graph neural network to predict atomic forces from local environments and showed its high performance in force prediction accuracy and speed. Introduction of graph message passing enhances the model ability to recognize symmetries and invariances (permutation, rotation and translation) in the system. Hierarchical pooling from atomlevel to residue level and then to the protein level enables the model to learn global transition between different metastable states that involves atomicscale detailed dynamics. Xie and coworkers xie2019graph developed graph dynamical networks (GdyNet) to investigate atomic scale dynamics in material systems where each atom or node in the graph has a membership probability to the metastable states. Graph representation of materials in their model enabled encoding of local environment that is permutation, rotation and reflection invariant. The symmetry in materials facilitated identifying similar local environment throughout the material and learning of the local dynamics. This type of approach can be used to learn local dynamics in some biophysical problems such as nucleation and aggregation where local environment is important.The introduction of Graph Neural networks to VAMPNets enables the higher resolution of the kinetic model and higher interpretability. A large increase in VAMP2 score is observed when switching from distance based features to graphbased. This suggests the usefullness and representation capability of GNN for further improving the kinetic embedding of MD simulation. We have tested the GraphVAMPNet with two different types of graph neural networks (Graph Convolution layer and SchNet) on three longtimescale protein folding trajectories. The GraphVAMPNet showed a higher VAMP2 score than the standard VAMPNet and the implied timescales showed a faster convergence in GraphVAMPNet due to an efficient representation learning as opposed to standard VAMPNet. This enables choosing a lower lagtime for building the dynamical model and improves the kinetic resolution of the resulting Markov model. The graph embeddings resulted from GraphVAMPNet are highly interpretable and shows clustered data in low energy minima in a free energy landscape. Furthermore, the addition of attention mechanism into SchNet enables us to decipher the residues and bonds that most contribute to each of the metastable states. However care must be taken when interpreting the attention scores returned by the model. One main obstacle of GNNs is that they cannot go deeper than a few layers (3 or 4) and suffer from oversmoothing problems in which all node representations tend to become similar to one another. An architecture that enables training deeper networks is the residual or skip connections as deployed in ResNet architecture is used here to train a deep neural network.he2016deep
Due to the flexibility of graph representation of molecules, other physical properties of atoms or amino acids such as electric charge, hydrophobicity. can be encoded into node or edge features in order to enhance the physical and chemical interpretability of the model. Moreover, hierarchical pooling layers can be applied to learn the dynamics at different resolutions of the molecule.ying2018hierarchical
Our GraphVAMPNet is inherently transferable. This means theoretically given a sufficient amount of dynamical data, transfer learning can be leveraged to reduce the number of trajectories needed for studying the dynamics of a particular system of interest.
Timereversibility and stochasticity of the transition matrix are the two physical constraints that are needed to get a valid symmetric transition matrix for analyses such as transition path theory (TPT). Physical constraints added to the VAMPNet and the resulting model called revDMSM was successfully applied to study kinetics of disordered proteins. This allowed to have a valid transition matrix and therefore rates could be quantified for interesting processes. This revDMSM was further extended by including experimental observables into the model as well as a novel hierarchical coarsegraining to give different levels of detail. These physical constraints can be further added into GraphVAMPNet to obtain a valid and high resolution transition matrix.
In summary, our GraphVAMPNet automates the feature selection in VAMPNet to be learned from graph message passing on the molecular graphs which is a general approach to understand the coarsegrained dynamics.
Acknowledgements
This work was partially supported by the National Heart, Lung and Blood institute at the National Institute of Health for B.R.B. and M.G.; in addition, it was partially supported by the National Science Foundation (grant number CHE2029900) to J.B.K. The authors acknowledge the biowulf highperformance computation center at National Institutes of Health and Lobos for providing the time and resources for this project. We also would like to thank DE. Shaw research group for providing the simulation trajectories.
Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this article.
Data Availability
The source code for the analysis can be found at github page: github.com/ghorbanimahdi73/GraphVampNet
Supplementary Materials
The training and validation loss and the results of GraphVAMPNet model with for Trpcage, Villin and NTL9 can be found in the supplementary information.