Repository for 'Interpretable embeddings from molecular simulations using gaussian mixture variational autoencoders'
Extracting insight from the enormous quantity of data generated from molecular simulations requires the identification of a small number of collective variables whose corresponding low-dimensional free-energy landscape retains the essential features of the underlying system. Data-driven techniques provide a systematic route to constructing this landscape, without the need for extensive a priori intuition into the relevant driving forces. In particular, autoencoders are powerful tools for dimensionality reduction, as they naturally force an information bottleneck and, thereby, a low-dimensional embedding of the essential features. While variational autoencoders ensure continuity of the embedding by assuming a unimodal Gaussian prior, this is at odds with the multi-basin free-energy landscapes that typically arise from the identification of meaningful collective variables. In this work, we incorporate this physical intuition into the prior by employing a Gaussian mixture variational autoencoder (GMVAE), which encourages the separation of metastable states within the embedding. The GMVAE performs dimensionality reduction and clustering within a single unified framework, and is capable of identifying the inherent dimensionality of the input data, in terms of the number of Gaussians required to categorize the data. We illustrate our approach on two toy models, alanine dipeptide, and a challenging disordered peptide ensemble, demonstrating the enhanced clustering effect of the GMVAE prior compared to standard VAEs. The resulting embeddings appear to be promising representations for constructing Markov state models, highlighting the transferability of the dimensionality reduction from static equilibrium properties to dynamics.READ FULL TEXT VIEW PDF
Repository for 'Interpretable embeddings from molecular simulations using gaussian mixture variational autoencoders'
Particle-based computer simulations can provide unprecedented mechanistic insight into the driving forces of complex molecular systems, in contexts ranging from biochemistry to materials science [1, 2, 3]. These simulations rely on numerical integration of the relevant equations of motion as a means to navigate the system’s conformational space. Due to the high dimensionality of this space, which prevents the exhaustive enumeration of all microstates, exploration is typically achieved through importance sampling 
. Conformational sampling leads to an estimate of the potential energy landscape (PEL), which follows a Boltzmann distribution at equilibrium. Unfortunately, characterization of the PEL suffers from the so-calledcurse of dimensionality 
—organization of the data in the high-dimensional space is challenging due to low population density. This problem is often remedied by projecting the PEL onto a lower-dimensional manifold, i.e., by performing a dimensionality reduction. By averaging over presumably unimportant degrees of freedom, the resulting low-dimensional surface represents afree-energy landscape (FEL). The ideal FEL distinguishes between microstates that are separated by large barriers on the PEL, yielding a partitioning of configuration space into collections of microstates, i.e., metastable basins. If all the largest barriers are accounted for, intra-basin diffusion will occur much faster than inter-barrier crossing events, allowing an accurate, albeit coarse-grained, description of both the static and dynamical properties of the system.
The essential degrees of freedom that define the low-dimensional representation, commonly referred to as collective variables (CVs), are traditionally identified through expert physical/chemical intuition that is often rather specific for the particular system or process of interest [6, 7, 8, 9]. Beyond the characterization of the FEL, these CVs can also be used for enhanced sampling , or for the construction of low-dimensional configuration-space discretizations, for instance when building Markov state models (MSMs) 
. Although the manual selection of CVs can be extremely effective for practitioners with insight into the system, the approach is difficult to extend systematically and is susceptible to missing unanticipated or subtle features of the FEL that may nonetheless play an important role in the relevant phenomena. Data-driven techniques provide an alternative route by inferring the important features directly from the data. There is a long history of methods for finding an optimal low-dimensional representation from a given set of data, employing both linear (e.g., principal component analysis
, time-lagged independent component analysis) and nonlinear (e.g., Isomap , Sketchmap ) transformations.
In the last couple years, there has been a growing interest in applying (deep) neural networks to automate the discovery of CVs[16, 17, 18, 19, 20]. One architecture that stands out as conceptually appealing is the autoencoder . An autoencoder is a bow-tie-shaped network that forces an information compression in the bottleneck region. While the first half of the network (the encoder) reduces the input to a predefined lower dimension, the second half (the decoder) aims at transforming from the low-dimensional to the original representation. The weights of the neural network are tuned to minimize an objective or loss function, which typically penalizes deviations between input and output data. As such, the autoencoder aims at discovering a latent space (embedding) that faithfully describes the essential features of the high-dimensional input data. This makes autoencoders well suited for constructing low-dimensional FELs from molecular simulation data [22, 23, 24].
Traditional autoencoders lack continuity in the latent space, preventing interpolation between training points and, thus, its generative ability. Variational autoencoders (VAEs) remedy this limitation by modeling the input probability distribution using Bayesian inference. VAEs enable sampling new data from the learned distribution (i.e., VAEs are generative models), and are also well-suited to provide interpretable and disentangled data representations in the low-dimensional space . Within the VAE framework, the latent distribution is forced to resemble a predefined probability distribution, called the prior
. Although the VAE framework does not impose any particular prior distribution, it is often chosen as a normal distribution for computational convenience. This prior induces an “anti-clustering” effect in the latent space, which can prohibit the identification of meaningful clusters and impede the construction of optimal FELs from molecular simulations. The autoencoder-based approaches were recently extended to explicitly incorporate the temporal nature of the data via a time-lag in the network architecture[27, 28]. These time-lagged autoencoders aim to retain information about the slowest dynamical modes sampled in the underlying simulation trajectory and, as a consequence, may encourage metastable clustering in the latent space. However, they are also limited in terms of characterizing the hierarchy of long timescale processes , and only indirectly address the anti-clustering issue.
In this work, we propose to directly acknowledge the multi-basin structure of an ideal FEL by employing a Gaussian mixture model as the prior distribution for the VAE latent space. The resulting Gaussian mixture variational autoencoder
(GMVAE) retains the computational ease and reconstruction fidelity of traditional VAEs, while enforcing a more faithful description of the underlying physics: the resulting FEL clearly distinguishes between metastable basins separated by large free-energy barriers. We demonstrate the benefits of the GMVAE approach through explicit comparisons with the traditional VAE for two widely-studied toy models and for the standard benchmark system for conformational dynamics, alanine dipeptide, as well as a more challenging disordered peptide ensemble. To ensure the presence of distinct distributions in the latent space, the GMVAE introduces a categorical variable that (probabilistically) assigns each input configuration to the set of clusters. Thus, the GMVAE simultaneously performs dimensionality reduction and unsupervised clustering. Remarkably, the GMVAE clustering is capable of identifying the inherent dimensionality of the input data, in terms of the number of Gaussians required to categorize the data. In the case of hierarchical input data (i.e., data with distinct dimensionality depending on the level of resolution), we show that the GMVAE makes a reasonable prediction for the number of clusters, independent of the given hyperparameter, based on the dimensionality of the latent space and characteristics of the data. Beyond the representation of static equilibrium properties, by constructing MSMs from the GMVAE embedding, we show that our approach is also a promising avenue for accurately describing the long timescaledynamical properties of the data. In contrast to recent deep neural-network approaches that aim to directly model the propagator of the system’s dynamics [31, 32], the construction of MSMs from the learned FEL offers a different strategy: explicitly testing to what extent a representation appropriate for the statics is directly amenable for the dynamics.
Autoencoders are special types of neural networks that are used for the task of representation learning in an unsupervised manner. They are composed of two connected parts: the encoder compresses the input signal to a low-dimensional representation, whereas, the decoder aims to reconstruct the input at full dimensionality from the reduced-space representation. The reconstruction loss, usually defined as either the mean-squared error or cross-entropy between the input, , and the output,
, is minimized via backpropagation. Since the bottleneck dimension is typically much less than the original dimension, autoencoders learn the most compact representation of the input. Furthermore, because neural networks are universal function approximators, the learned data projections can generally preserve much more of the relevant information than with PCA or other basic linear projection techniques. Figure1 shows the schematic structure of autoencoder with mean-squared error loss. There are different types of autoencoders which are tailored for special tasks. For instance, sparse autoencoders impose sparsity constraints during optimization, whereas convolutional autoencoders utilize convolutional layers instead of fully-connected layers, in which case they learn the optimal filters. Variational autoencoders, which model the latent space probabilistically, are used for generative purposes, i.e., they can create new samples that look like the ones in the training dataset without simple data replication.
Variational autoencoders were introduced in . In general, the theory of VAEs is approached from two different perspectives: variational inference and neural networks. This section starts with the former interpretation and then illustrates the connection between them. We mostly follow the notation and reasoning used in . The input data and the latent variable are denoted by and , respectively.
The objective of the VAE is to find the posterior distribution , which can be written in terms of the likelihood , the prior , and the marginal probability density of , , using Bayes law as
The denominator is called the evidence and it could, in principle, be calculated using
once the prior is selected. However, the calculation is typically intractable, as it needs to be evaluated over all configurations of the latent variable . Therefore, the posterior is approximated using variational inference with a chosen easy-to-evaluate family of distributions , e.g., Gaussian functions, where is the variational parameter of the distribution. In particular, is inferred using
by reformulating the problem within an optimization framework, such that the Kullback-Leibler divergence betweenand is minimized. The KL divergence between and is defined as
Equation 1 is then inserted into the posterior definition.
Since the expectation is taken over , can be moved out of the expectation.
The initial objective of minimizing the KL divergence between the exact and the approximate posterior is equivalent to maximizing the ELBO (Evidence Lower BOund), defined in Equation 5.
Equation 5 can also be rewritten in terms of a different KL divergence:
Here the neural network perspective comes into play, as depicted schematically in Figure 3(a). acts like an encoder (inference), and transforms the data into the latent variable . On the other hand, (which can also be parametrized with the network parameter as 111Both of the notations are used interchangeably.) generates the data from the latent representation, analogous to a decoder (generator). The parameters correspond to the weights and biases of the neural networks. Note that the initial aim is to minimize , which is equivalent to minimizing the RHS of Equation 6. The first term enforces the encoder to be similar to the chosen prior , which acts as a regularization, whereas the second term on the RHS deals with how well the reconstructions match the original input.
In order to use Equation 6 in an optimization procedure, both the family of distributions for inference, , as well as the prior distribution, , must be specified. The most common assumption is that (
) is a unimodal Gaussian distribution with mean(0) and diagonal covariance (). Then, has a closed form solution:
where is the dimension of the Gaussian and tr denotes the trace. Although the unimodal Gaussian assumption simplifies the calculations, it also restricts the possible latent space representations, and may hinder the performance of the variational autoencoder by pushing the latent space to be described by highly-overlapping clusters.
(a) The VAE and (b) GMVAE architectures. In the probabilistic graph representation, circle nodes represent the random variables, and directed edges represent statistical dependencies between the variables in the two ends. Dot nodes are used to indicate the parameters of the model, while some of the nodes are intentionally filled to differentiate the observed random variables from the non-observed ones which are left empty.
This section is largely distilled from the discussion and insights presented in . The term Gaussian mixture variational autoencoder is open to misinterpretations. There exist several distinct architectures given this name, with variations in the choice of generative or inference models [30, 35, 36, 37]. In the present work, we take both the approximate posterior, (i.e., the family of distribution functions for inference), , and the latent space distribution (i.e., the prior), , to be Gaussian mixtures. Note that we have introduced a categorical variable, , which identifies which Gaussian each particular data point belongs to. The inference model can be written as
The latent space is composed of distinct Gaussians, i.e., is assumed to be Gaussian, where . Thus, the approximate posterior becomes a Gaussian mixture.
Similar to Equation 5, the ELBO can be written as
where the number of Gaussians, , is a hyperparameter, and the subscript m is used to distinguish from the VAE ELBO. can be written as using conditioning without any assumptions. Then, by assuming that is conditionally independent of , i.e., (see the graph representation in Figure 1(b)), the joint probability can be expressed as
Similar to the VAE, the third and fourth terms represent regularization and reconstruction contributions to the loss, respectively. The initial prior on is selected as a uniform multinomial distribution, while can be interpreted as a conditional entropy, reflecting how informative is on . To directly control the impact of the clustering relative to the other loss terms during training, we introduced a weighting factor, , on the mutual information between and :
Figure 3 presents a more detailed schematic of the GMVAE architecture, while Table 1 presents a summary of the probability distributions utilized in the model. First, data points are probabilistically assigned to clusters (NN(Q)).
represents these cluster assignment probabilities, and has multinomial distribution. Since each cluster is assumed to have Gaussian distribution in the latent space, the mean and variance of each of these Gaussians () are learned via the encoder part of the neural network (NN(Q)). The low-dimensional representation, , is then obtained by first sampling and then taking the expected value of these samples, i.e.,
. As the first step in decoding, the moments of the corresponding low-dimensional representationis learned by NN(P) from each Gaussian-distributed individual cluster , which is then followed by a sampling operation.
in the decoder is assumed to be uniformly distributed among theclusters. Next, using the encodings, ’s, the associated reconstructions are obtained again by sampling from the by the NN(P). Similar to the encoder, the decoder obtains a fixed reconstruction by taking the expected value of ’s.
The clustering within the GMVAE is probabilistic, i.e., each data point is assigned membership probabilities (between and ) to each of the clusters. Since most configurations are assigned predominantly to a single cluster, we perform a hard cluster assignment by assigning each data point to the cluster with highest membership probability. However, in cases where a configuration has similar membership probabilities for multiple clusters, this simple assignment may introduce errors when determining properties (e.g., transition probabilities) of the clusters. Thus, we also considered a different approach by enforcing a thresholding value for cluster assignment. More specifically, each configuration is only assigned to a cluster if the largest membership probability is above a chosen cut-off value. A naive coring scheme followed the thresholding operation such that the points that had been identified as noise were assigned back to their previous cluster index for all other dynamical analyses.
The GMVAE algorithm was implemented in Tensorflow, and is available at https://github.com/yabozkurt/gmvae. Training was performed in all cases with fully connected layers, using the Adam optimization algorithm 
. The Softmax activation function was used for probabilistic cluster assignments, while ReLu activation functions were employed in all hidden layers. The means were obtained without any activation, whereas Softplus activation was employed to obtain the variances. Table2 shows the values of the hyperparameters for each example system. Default values were employed wherever the parameters are not specified. Number of nodes (NN()) columns correspond to the neural networks labeled in Figure 3. NN(Q) performs probabilistic cluster assignments, NN(Q) is for learning the moments of each Gaussian distribution in the encoding, whereas NN(P) and NN(P) are for the decoding of the and , respectively. The lengths of the “Number of nodes” entries correspond to the number of hidden layers. Hyperparameter optimization was carried out as follows. The number of nodes was initialized as [16, 16]. The number of nodes in the decoder (NN(P)) was then increased whenever a large and non-decreasing reconstruction loss was observed. Our overall observation for the considered examples is that the learning rate and batch size should be kept relatively low to promote the formation of distinct cluster. The VAE results (with unimodal Gaussian prior) that are provided as comparison are obtained using , while keeping the remaining parameters equal to the values in the corresponding GMVAE model.
|1D 4-well||Müller-Brown||Dipeptide||- I||- II|
|Number of clusters (k)||4||5||8||10||6|
|Input dimension (n)||1||2||25||60||126|
|Latent dimension (d)||1||1||2||2||2|
|Number of nodes (NN(Q))||[16, 16]||||||[16, 16]|||
|Number of nodes (NN(Q))||[16, 16]||||||[16, 16]|||
|Number of nodes (NN(P))||[16, 16]||||||[16, 16]|||
|Number of nodes (NN(P))||[16, 16]||||||[16, 16]|||
Number of epochs
Markov state models (MSMs) represent the dynamics generated by a molecular simulation trajectory as a series of memoryless jumps between a discrete set of states . Given a configuration-space discretization, a transition probability matrix, , is obtained by counting the transitions between pairs of states within a given lag time, , and then performing a maximum likelihood optimization 
. The eigenvalues of, , are related to characteristic timescales of the system’s dynamics:
where is the timescale corresponding to the eigenvalue, . The time lag parameter is typically chosen by performing the “implied timescale test”, which assesses the Markovianity of through the convergence of its timescales with increasing . In other words, is plotted as a function of , and is then chosen as small as possible such that the largest timescales are sufficiently converged. Once is chosen, the accuracy of is determined via the Chapman-Kolmogorov (CK) test, which compares the estimated and predicted probability decay out of a given state. The predicted values are obtained using the CK equation, i.e., using the Markovian property of the model:
where is the probability of transitioning from state to state within time , and is a positive integer. The CK test is often performed on metastables states of the system—collections of quickly interconverting microstates.
Within the standard Markov-state-modeling workflow, microstates are typically defined on low-dimensional projections of the full-dimensional configuration space. Therefore, obtaining a relevant transformation of the molecular simulation data is the key. To this end, time-lagged independent component analysis (TICA) [13, 42] is one of the most commonly used dimensionality reduction methods, as its objective is to maximize the autocorrelation of the data at the given lag time, making it especially well suited for kinetic modeling purposes. Metastable states are typically obtained via a dynamical coarse-graining procedure, e.g., PCCA+ 
whose objective is to retain an accurate description of the dominant eigenvectors of the transition probability matrix. The resulting metastable states are then used as representative collections of microstates for performing the CK test. In many cases, a coarse-grained MSM at the resolution of the metastable states is constructed, providing an easily interpretable, albeit often qualitative, picture of the long timescale processes. In this study, the GMVAE performs the dimensionality reduction and clustering simultaneously, yielding a coarse-grained description of configuration space directly, without the need for further dynamical clustering. The (coarse-grained) MSMs are constructed from the discretized trajectories obtained using the simple cluster assignment based on the GMVAE membership probabilities as described in Section2.3.1. MSM construction and analysis was performed using the PyEMMA package .
The helical propensity of the peptide was determined using the Lifson-Roig perspective, which assigns each residue to either a helical (h) or coil (c) state, according to the dihedral angles along the peptide backbone (i.e., the Ramachandran plot) [45, 46]. Therefore, the number of different conformations of the peptide is limited to , where is the number of residues; 15 for . The propensity of residue to be part of a “helical segment”, , is then defined as the probability that residue as well as its two neighboring residues are simultaneously found in a helical state. The average fraction of helical segments, , is obtained by averaging over all residue positions: . To distinguish between partial helical structures occuring at the N- and C-terminus ends of the peptide backbone, we define and . Note that the terminus residue from each end is not taken into consideration.
The dRMSD measures the average deviation of internal distances from the corresponding distances in a reference structure, and is calculated as
where represents the conformation at time t, is the conformation for the reference structure, and denotes the Euclidean norm. Note that, unlike other RMSD metrics, no pre-alignment of structures is required. In this study, due to the large fluctuations of the end residues, two residues from each end of the peptide were excluded in the dRMSD calculations. dRMSD was calculated using the positions of the atoms only. Helix, hairpin-like, and extended (coil) structures were separately considered as reference structures as illustrated in Figure S12.
Variational autoencoders (VAEs) have been previously applied for dimensionality reduction of molecular simulation data [28, 17, 47]. VAEs typically employ a normal distribution to represent both the prior distribution in the latent space and the family of distributions for variational inference. In this work, we extend traditional VAEs by representing these distributions with Gaussian mixture models. The resulting Gaussian mixture VAE (GMVAE) adopts the physics-based viewpoint that an optimal embedding of the simulation data should give rise to a free-energy landscape (FEL) with well-separated clusters of configurations, which correspond to metastable states that are separated by large barriers along the high-dimensional potential energy landscape. The GMVAE introduces a categorical variable, , which represents the various underlying Gaussian distributions to which each configuration will be (probabilistically) assigned. As a consequence, the approach simultaneously performs a dimensionality reduction and clustering, while enabling direct control over the organization of configurations in the latent space. We demonstrate the properties of this architecture by considering two model systems and molecular simulations of alanine dipeptide as well as a more challenging disordered peptide ensemble. In the following, X represents the dimensional input. The latent variable in the bottleneck is represented by .
We first consider a single particle in one-dimension interacting with a 4-well external potential, which has been previously employed for testing methods associated with constructing MSMs [48, 49]. Figure 3(a) presents the potential, whose functional form and simulation details are given in Section S.II. We employ a GMVAE with a latent space dimension of 1, which assesses the clustering performance of the architecture in the absence of any dimensionality reduction. The GMVAE was trained with 4 according to the parameters in Table 2. Figure 3(b)
presents the confusion matrix of the resulting model, which quantifies the probability that the model assigns a predicted label (x-axis) given the true label (y-axis). The true labels were determined using a coarse-grained representation of the system, where four metastable states are defined based on simple dividing surfaces, chosen as the maxima of the barriers between each potential well (dashed vertical lines in Figure3(a)). The GMVAE assigns the state labels with 97 overall accuracy.
Figure 3(c) shows a normalized histogram of values. Without dimensionality reduction, the GMVAE largely retains the description of the input space within the latent dimension. As a consequence, the decoder is able to quite accurately reconstruct the input from the latent variable (See Figure S1). This behavior is in stark contrast to traditional VAEs, which employ a Gaussian prior to represent the latent space distribution. As a result, anti-clustering effects can arise, leading to highly overlapping clusters of data in the reduced space. To demonstrate this effect, we constructed a traditional VAE for the present example. Figure 3(d) presents the corresponding normalized histogram of values. In this case, even without a reduction in dimension, significant information is lost due to the constraint of the assumed prior distribution.
To further characterize the quality of the GMVAE clustering, we constructed an MSM from the trajectories of the predicted cluster IDs. Figure 4(a) presents the standard implied timescale test, which assesses the convergence of the characteristic timescales with increasing lag time parameter . Convergence indicates that the simulation dynamics, within the discrete-state representation, can be described within a Markovian approximation. The grey area indicates timescales that cannot be resolved by the model, since they are faster than the chosen lag time. From the test, the MSM with was chosen for further analysis. The accuracy of this model was assessed with the Chapman-Kolmogorov test, which compares the simulated and predicted decay of probability from a chosen set of metastable states. Figure 4(b) demonstrates that the predicted “cluster dynamics” accurately represent the long timescale kinetic properties of the underlying simulation trajectory.
To assess both the dimensionality reduction and clustering performance of the GMVAE approach, we next consider a single Brownian particle in two dimensions interacting with an external Müller-Brown potential. The trajectory data was generated as the procedure suggested in  with the standard parameters  (see Section S.III for more details). As depicted in Figure 5(a), the resulting FEL contains two deep minima along with a less stable intermediate state. We employ a GMVAE that is trained with a latent space dimension of 1 and with 5, according to the parameters in Table 2.
Despite employing 5, the resulting GMVAE model identified only 3 states with non-zero membership probabilities. Thus, somewhat remarkably, the GMVAE architecture was able to identify the inherent organization of the input data in the high-dimensional space, independent of the hyperparameter . Figure 5(b) shows the identified clusters. We define the true cluster labels in this case using linear dividing surfaces, as shown in Figure 1(a). Figure 5(c) presents the confusion matrix from the GMVAE model with respect to these defined labels. Although it appears that there are errors in assigning state 1, this error is sensitively dependent on the precise definition of the true label dividing surfaces. Moreover, the overall classification accuracy is actually , since state 1 corresponds to a very rarely sampled intermediate state. The model also demonstrates relatively high reconstruction accuracy (See Figures 1(b) and 1(c)). Figures 5(d) and 5(e) present normalized histograms of values obtained from the GMVAE model and a traditional VAE model trained on the same data, respectively. The low-dimensional representations obtained from the GMVAE clearly demonstrate a better separation of metastable states. Additionally, the ability of the GMVAE to learn a nonlinear manifold is demonstrated in Figure S3, with respect to the linear embedding obtained using time-lagged independent component analysis (TICA).
To further characterize the quality of the GMVAE clustering, we again constructed an MSM from the trajectories of the predicted cluster IDs. The implied timescale test (Figure 6(a)) shows two dominant processes. The MSM with was chosen for further analysis. Figure 6(b)) presents the Chapman-Kolmogorov test, which further verifies the accuracy of the GMVAE embedding.
Alanine dipeptide is a representative model system for the characterization of conformational dynamics. Previous work [51, 31, 52, 27, 49] has shown that the , backbone dihedral angles act as ideal collective variables for describing the metastable configurational basins and associated transition kinetics, making it an excellent system for testing the GMVAE framework within a more realistic molecular simulation context. Since in general the optimal set of input features is unknown a priori, we use this example to test the ability of the GMVAE to identify the proper collective variables from a larger set of input features. More specifically, we consider as input features both the normalized pairwise distances between heavy atoms as well as the , dihedral angles (obtained from 
). The pairwise distances were pre-processed using a kurtosis filter (with the threshold value of 0.03, see FigureS4 for more detail), to reduce the input dimension by removing the low-variance features. The dihedral angles were pre-processed by applying and transformations in order to account for periodicity . Figure 7(a) shows the FEL in the backbone dihedral angle space, with four labeled metastable basins corresponding to , , , P, and conformations . The gray lines are drawn for reference and do not represent any sort of optimal dividing surface.
Figure 8(a) presents the two-dimensional embedding found using the GMVAE, and Figure 8(b) shows the simultaneously-obtained 6 clusters (indexed from to ) as a part of the GMVAE algorithm. The GMVAE again obtains a FEL that better separates clusters of conformations, relative to a standard VAE (Figure S8). The distribution of these clusters on the Ramachandran plot (Figure 7(b)) already strongly indicates their suitability for a kinetic analysis. The GMVAE clustering distinguishes all 5 of the metastable states, as well as a transition region between the and states (cluster 4). An MSM was again constructed from the coarse GMVAE cluster assignments. The implied timescale and Chapman-Komolgorov tests are presented in Figure 10, demonstrating the accuracy of this kinetic model.
We found in this example that, unlike the toy systems, the clustering obtained using the GMVAE did not appear to be completely robust. In particular, the precise clustering probabilities depend on the random effects of the training procedure (e.g., random weight initialization and the random shuffling of the input data). This issue was most pronounced for the lowest populated state, whose probability differs from the other states by two orders of magnitude (Figure 4(b)). As a consequence, the state was not always sufficiently separated from the state, resulting in a loss of one of the resolved kinetic processes (although the accuracy of the MSM remained intact, see Figure S7). Despite this issue, the obtained FEL appeared rather robust with respect to changes in the random factors during training. We observed a much more robust clustering for all other applications considered.
As a more challenging test, we consider simulation trajectories of the capped helix forming peptide AC-(AAQAA)-NH2, which is a representative system for investigating helix-coil transitions. We employ a coarse-grained model , which describes the dominant attractive interactions, e.g., hydrogen bonding and effective hydrophobic interactions between side chains, with simple potentials between the and atoms. These interactions are the minimum required to sample the proper range of structures, (i.e., helix, coil, and hairpin-like). This model also represents excluded volume effects in near-atomic detail, which was demonstrated to be important for accurately characterizing the helix-coil kinetics. Here we employ a parametrization of the model that most accurately reproduces the experimental cooperativity of the helix-coil transition for . As a result, hairpin-like structures appear to have relatively low metastability (similar to the intermediate state in the Müller-Brown example, and the state in alanine dipeptide), as we discuss further below. The model and simulation protocol are discussed further in the Supporting Information, and also in [56, 57]. The considered simulation trajectories correspond to a disordered ensemble of peptide configurations, representing a stringent test for dimensionality and clustering methods .
Similar to alanine dipeptide, the set of and augmented dihedral angles along the peptide backbone were used as conformational descriptors. Thus, the input dimension is for the 15-residue peptide. We chose to consider only a latent space dimension of 2, given that the ultimate goal of dimensionality reduction is often to reduce the high-dimensional description to something that is easily visualizable. Unlike the simple model systems above, the number of clusters, , is completely unclear a priori. In fact, we expect that this ensemble to have a hierarchical structure, such that differing number of clusters may be appropriate depending on the chosen level of resolution. While we initially considered the GMVAE with varying number of clusters, we found that the number of “non-zero clusters” (i.e., clusters with a significant probability of configuration assignment) was extremely insensitive to this choice, as discussed below. The GMVAE was trained according to the parameters in Table 2. Also in contrast to the previous examples, there is no definitive reference kinetic model with corresponding known metastable states. Instead, the analysis below assesses the GMVAE embedding and clustering (in terms of both statics and kinetics) with respect to the landscapes obtained using a standard VAE and also following the standard MSM workflow (i.e., TICA [13, 42], see Section 2.4 for more details).
Panels (a) and (b) of Figure 11 show the FELs obtained using the GMVAE and the traditional VAE, respectively. As in the model systems, the GMVAE method results in a latent space description with highly separated clusters, while the traditional VAE yields more overlapping states. The two-dimensional TICA landscape (Figure S19) also separates a number of clearly distinct states, although there are large diffuse regions with relatively low free-energy values. The clusters obtained via the GMVAE are shown in Figure 11(a). Despite employing and obtaining a landscape that appears to have approximately 10 distinct basins, only 7 states (labeled ) were assigned non-zero membership probabilities (see Figure S9). Since standard metrics for analyzing peptide configurations do not yield a clear organization of the ensemble into a small number of metastable states, the distribution of these quantities are expected to be highly overlapping, even for a good clustering of the input data. Thus, to more easily visualize the characteristics of the GMVAE clusters, we applied a thresholding scheme, which removes configurations without a membership probability greater than 0.95 (see Section 2.3.1 for details and Figure S10 for cluster populations). Figure 11(b) shows 5 representative structures closest to the cluster centers. We stress that these images are intended to give the reader a rough idea of the types of structures contained in each cluster, but do not characterize the variance of structures within the clusters. This is a disordered ensemble and each cluster necessarily contains a diversity of structures. Nevertheless, Figure 11(b) indicates that the GMVAE successfully distinguishes between distinct secondary structures within the simulation data.
To characterize the structural properties of the clusters quantitatively, we calculated the distribution of the average fraction of helical segments, . Figure 12(a) presents a heat map of in the latent space. High values (represented by blue) indicate the presence of helix and helix-like structures, whereas the lower values point to either hairpin- or coil-like secondary structures. There is an apparent trend of decreasing average helical content from the lower-right to upper-left regions of the latent space (i.e., from cluster to ). The VAE and TICA landscapes demonstrate similar trends (Figures 22(b) and 18(b), respectively), although the VAE does not characterize partially-helical structures as clearly as the GMVAE. Figure S11 presents the intra-cluster distributions of , which can be used to assess the quality of the clustering (relative to an alternative clustering). We expect that an optimal clustering will result in tight, unimodal distributions. The GMVAE clustering yields seemingly good distributions for the most and least helical clusters, while the partially-helical clusters appear broader and somewhat bimodal. For comparison, we consider three alternative clusterings obtained by performing a -means clustering on a given landscape followed by the PCCA+ dynamical coarse-graining method  to define a set of metastable states (see Section 2.4 for more details): (i) an alternative clustering of the GMVAE landscape (Figure S16), (ii) a clustering on the VAE landscape (Figure S24), and (iii) a clustering on the TICA landscape (Figure S20). The alternative clustering scheme on the GMVAE landscape, (i), does not improve the intra-cluster distributions of , demonstrating that the GMVAE clustering is reasonable, given the GMVAE embedding. Similar results were obtained from the VAE clustering, with slightly broader distributions for the most and least helical states. The TICA clustering resulted in somewhat improved distributions, in the sense that they appear to be mostly unimodal, although some of the distributions appear to be slightly broader.
Figure 13(b) shows the dRMSD values of the projections, where the helicity increases as the dRMSD values decreases. These results are in agreement with the analysis: as the cluster index increases from to , the conformations tend to be more extended. The Supporting information (Figures S12 and S13) contains additional characterization of the static properties of the clusters, which further validate the GMVAE embedding and clustering as a reasonable partitioning of the conformational landscape.
We also characterized the average fraction of helical segments on the N- and C-terminus sides of the peptide: and , repectively (see Section 2.5 for more details). Figure 14 presents the difference of these quantities, , plotted along the GMVAE embedding. Positive values (represented by blue) indicate conformations that contain helical structure on the N-terminus side of the peptide without helical structure on the C-terminus side. Conversely, negative values (represented by red) indicate conformations that contain helical structure on the C-terminus side of the peptide without helical structure on the N-terminus side. Values close to zero correspond to either fully helical or non-helical structures. Although the GMVAE embedding and clustering separate the most distinct structures in the ensemble (coils and full-helicies), some of the clusters (0, 1, 2) encompass partially-helical conformations on both sides of the peptide (see also Figure S15). This is not ideal since kinetic barriers within a cluster will negatively impact the accuracy of a kinetic characterization at the cluster level. However, it appears that this issue may have more to do with the clustering than the embedding itself, since blue- and red-labeled structures appear to be reasonably separated on the landscape.
Similar to the other examples above, we also constructed an MSM directly from the discretized trajectories of GMVAE cluster indices. Although thresholding was applied in the results presented here (practically similar to coring methods for constructing kinetic models ), we found that this procedure had negligible effect on the accuracy of the resulting MSM. As shown in Figure S14, the MSM constructed from the GMVAE clustering displayed significant errors in describing, e.g., the decay of probability out of the helix state. Perhaps this is not so surprising, since coarse-grained MSMs are often only used as a qualitative analysis tool, while higher-resolution kinetic models that characterize configuration space with many microstates are used for quantitative reproduction of simulation kinetics. Thus, to more carefully assess the GMVAE embedding and to more easily compare to the VAE and TICA results, we constructed a higher-resolution MSM by performing -means to define microstates on the landscape (Figure S16). Although the resulting model demonstrates improved accuracy according to the Chapman-Kolmogorov test, the probability decay out of the metastable states occurs on a fast time scale relative to the chosen lag time. This may be indicative of poorly defined dividing surfaces between metastable states. The kinetic models constructed from the VAE and TICA landscapes (Figures S24 and S20, respectively) demonstrate similar quickly decaying probabilities. Although coring procedures could be applied to attempt to fix this problem, it indicates that there are fundamental limitations of all of these landscapes in terms of characterizing the long timescale simulation kinetics. There are several possible reasons for these difficulties, including (i) the limitation of our embeddings to two dimensions, (ii) the limitation of the chosen input features in characterizing kinetically-distinct structures, (iii) the presence of many low-lying barriers along the potential energy landscape of this disordered ensemble, and (iv) the poor sampling of relatively rare transitions to the full helix conformation. Further investigation of this issue is beyond the scope of this initial study of the performance of the GMVAE, and is left for future work.
To investigate the impact of the low sampling of helical structures on the GMVAE embedding, as in the - I simulations presented above, we also considered a second set of simulations which primarily samples helical- and hairpin-like structures, while only rarely sampling fully-coil structures. (Please see the Supporting Information for more details about the differences between the two sets of simulations). In addition to the dihedral angles, normalized pairwise distances between residues that are more than residues apart were included as input features. Figure 15 presents the obtained GMVAE FEL (panel (a)), the corresponding clustering of 6 metastable states (panel (b)), and overlays of five structures that are closest to the cluster centers (panel (c)). The GMVAE embedding demonstrates significant separation of metastable states, relative to the landscape obtained with a standard VAE (Figure 36(a)).
Similar to the previous ensemble ( - I), Figure 16 shows the separation of structures according to (panel (a)), and dRMSD (panel (b)). The VAE and TICA landscapes demonstrate similar trends (Figures S37 and S33, respectively). The intra-cluster distributions are shown in Figure S28. The majority of the fully-helical structures are in cluster 3 and 5, while clusters 0, 1, 2 and 4 contain hairpin-like structures as well as partial helicies. The coil structures are gathered in the bottom-most part of the landscape (in cluster 4), though not separated as a distinct cluster by the GMVAE. The distributions are broader and less unimodal than those determined from the previous set of simulations, although these can be somewhat improved with the alternative clustering scheme on the GMVAE landscape (Figure S32). Similar results are also obtained from the VAE and TICA landscapes (Figures S40 and S36, respectively).
Figure 17 presents the characterization of the N- and C-terminus, partially-helical conformations. In contrast to the - I embedding, the GMVAE embedding and clustering for - II more clearly separates the distinct types of structures. It appears that this difference may be due to the increased sampling of helical structures in - II, although the inclusion of pairwise distances as additional input features may also have played a role. N- and C-terminus partially-helical structures are mostly located in clusters 4 and 2, respectively, while both types of structures can be found to a lesser extent in cluster 5. Although the VAE and TICA landscapes also appear to largely distinguish between distinct partially-helical structures (Figures S37 and S33, resepectively), the GMVAE landscape provides a significantly better clustering of these two distinct sets of conformations.
Despite the improved description of partially-helical structures, the MSM constructed directly from the GMVAE clustering for - II displayed similar discrepancies to the model built for - I (Figure S29). Moreover, the high-resolution MSMs constructed from the GMVAE, VAE, and TICA landscapes (Figures S30, S38, and S34, respectively) displayed very fast decay of probability out of the identified metastable states, as in the - I example.
Variational autoencoders are quickly making an impact in the field of molecular simulations due to the inherent focus of the architecture on retaining the essential features of the system. Control over the topology of the latent space can increase the performance and interpretability of these methods by making a direct connection to the physics of the system through our physical intuition: an ideal free-energy landscape characterizes basins that are well-separated by the largest barriers along the higher-dimensional potential energy landscape. To explicitly enforce such features, we propose a Gaussian mixture model as the prior distribution in the latent space.
The performance of the Gaussian mixture variational autoencoder (GMVAE) was illustrated on two standard toy-model systems and on the standard benchmark alanine dipeptide, as well as on a challenging 15-residue-long disordered peptide. For each example, the GMVAE circumvents the aggregation of points in the latent space characteristic of traditional variational autoencoders. Instead, samples that are structurally distinct are clearly separated, leading to a latent space that displays apparent metastable basins and barriers. The GMVAE introduces a categorical variable that probabilistically assigns samples to a set of underlying clusters, each of which is Gaussian distributed. Thus, the approach combines the commonly distinct tasks of dimensionality reduction and clustering into a unified framework. In the absence of dimensionality reduction, the GMVAE retains the characteristics of the system within the latent space, while providing an accurate assignment between clusters. Remarkably, in the case of limited dimensionality reduction, the GMVAE identifies the inherent clustering of the input data, insensitive to the cluster-number hyperparameter.
Beyond statics, there have been several recent autoencoder architectures aiming at the characterization of molecular kinetics. Several of these methods directly incorporate kinetic information in the loss function, either by reconstructing time-lagged samples or by approximating the dynamical propagator[31, 32, 27, 28, 60, 61]. The interpretability of the latent space is becoming a feature of increasing interest: Hernández et al. recently proposed an approach for identifying the most important input features for determining the one-dimensional latent-space representation within a time-lagged VAE framework , while Wang et al. relied on a linear encoder to interpret the relevant coordinates of interest . Here, we argue that incorporating physical constraints into the architecture helps to construct an interpretable model for the kinetics, even when kinetic information is not used for learning the representation. The GMVAE architecture attempts to better mimic the shape of an ideal free-energy landscape within the latent space. In particular, the presence of barriers that separate metastable clusters determines the relevant kinetic properties through the separation of timescales between intra- and inter-basin transitions.
We report extremely encouraging results for constructing kinetic models from representations learned from static information alone. For the two toy models and for alanine dipeptide, the resulting Markov state models demonstrate excellent properties, as monitored by the implied timescale and Chapman-Kolmogorov (CK) tests. The disordered ensemble of the peptide proves more challenging: the CK test shows discrepancies for the decay of probability out of the longest-lived metastable states. Although higher-resolution MSMs constructed directly from the GMVAE landscape demonstrated an improved description of the simulation kinetics, these models were unable to resolve the longest timescale processes. Comparisons of two distinct peptide ensembles clarified the role that sampling can play in distinguishing distinct partially-helical structures on the GMVAE landscape. It remains unclear to what extent the restriction of our embeddings to two dimensions or the choice of input features prevented the GMVAE (as well as the more standard methods considered) from better describing the simulation kinetics. Moreover, the presence of many low-lying barriers along the potential energy landscape of this disordered ensemble may cause fundamental challenges in obtaining a clear few-metastable-state characterization of the conformational landscape. Thus, we propose that, in conjunction with simpler test systems that clearly assess a method’s performance, such examples are important for significant advancements in data-driven characterizations of molecular simulation trajectories.
While we defer a more detailed investigation of these issues for future work, we highlight the promising performance of the GMVAE demonstrated through our results. First, in the context of static equilibrium properties, the incorporation of the Gaussian mixture model as a prior distribution on the latent space closely links our physical intuition about ideal free-energy landscapes, resulting in an inherently more interpretable latent space. Secondly, our results show encouraging performance when constructing kinetic models from the learned representations—an aspect that is entirely absent in the loss function, representing an independent validation of the procedure.
The authors thank Kiran H. Kanekal and Omar Valsson for critical reading of the manuscript. JFR is grateful to the BiGmax consortium and participants of the BiGmax Big Data Summer School
for insightful discussions. YBV acknowledges foreign collaborative research study support by The Scientific and Technological Research Council of Turkey, TÜBİTAK- BİDEB, under the 2214-A programme. TB acknowledges financial support by the Emmy Noether program of the Deutsche Forschungsgemeinschaft (DFG) and the long program Machine Learning for Physics and the Physics of Learning at the Institute for Pure and Applied Mathematics (IPAM).
Time-lagged autoencoders: Deep learning of slow collective variables for molecular kinetics.The Journal of chemical physics, 148(24):241703, 2018.
Fuzzy spectral clustering by pcca+: Application to markov state models and data classification.Advances in Data Analysis and Classification, 7(2):147–179, 2013.
This document presents additional results to the main text. X denotes the reconstructions. The sampling operation in the reconstructions (shown in the decoding part of Figure 3), corresponds to taking the means of the Gaussians for simplicity.
The trajectory data is obtained as suggested in , and using the code provided in . transition probability matrix is obtained among the equally-spaced 100 bins in the interval [-1, 1] as follows
where and are the potential energies at the centers of bins i and j, which are defined according to the potential of the form: , and is the normalization factor. The system is initialized randomly, and propagated according to steps in time.
Figure S1 shows the reconstructions in a scatter plot. The line shows the lossless reconstructions.
where is the two-dimensional coordinate, and and are the standard parameters  such that , , , , , . The trajectory data is generated using 30 trajectories of 10000 steps simulated with Brownian dynamics:
where joules, and meters-squared per second, and is a delta-correlated Gaussian process with zero mean.
Figure S3 further demonstrates the ability of the GMVAE to learn a nonlinear manifold that separates the three distinct free-energy basins, compared with time-lagged independent component analysis (TICA), which can only find a linear separatrix for the basins. Figure S3, showing the projections obtained with the GMVAE and TICA, was constructed following , with the colors indicating values of the latent variable while the gray dots correspond to trajectory data.
As the input features, dihedral angles and pairwise distances for heavy atoms that are provided in  for three simulations of length 250 ns each are used. Dihedral angles are transformed to their representaions, and the pairwise distances whose variance are low are removed from the feature set (using kurtosis function from scipy.stats library , with threshold value of 0.03).
We applied TICA to the set of pairwise distances only, followed by a kinetic coarse-graining with the PCCA+ method into 4 metastable states. Figure 4(a) presents the resulting clusters plotted on the Ramachandran plot. Figures 4(b) and 4(c) show the histograms of these metastable states, and the GMVAE clusters, respectively.
We employ a simple physics-based peptide model that was previously used to investigate structural-kinetic relationships in helix-coil transitions [56, 65]. The model employs three attractive interactions, following standard Gō-type models [66, 67, 68, 69]: (i) a native contact (nc) attraction, , employed between pairs of atoms which lie within a certain distance in the native structure, i.e., the -helix, of the peptide, (ii) a desolvation barrier (db) interaction, , also employed between native contacts, and (iii) a hydrophobic (hp) attraction, , employed between all pairs of atoms of the amino acid side chains. We employed the same functional forms as in many previous studies , with a tunable prefactor, , for each of the interactions. The model considered here employed the prefactors , , and , while performing simulations at a temperature of 280 K. In addition to these simple coarse-grained interactions, a standard AA force field, AMBER99sb , is also partially incorporated to model both the steric interactions between all non-hydrogen atoms and also the specific local conformational preferences along the chain.
Molecular dynamics simulations of were performed with the Gromacs 4.5.3 simulation suite  in the constant NVT ensemble, while employing the stochastic dynamics algorithm with a friction coefficient and a time step of . For each model, 100 independent simulations were performed with starting conformations varying from full helix to full coil. Each simulation was performed for , recording the system every . The CG unit of time, , can be determined from the fundamental units of length, mass, and energy of the simulation model, but does not provide any meaningful description of the dynamical processes generated by the model. In this case, ps.
Since the GMVAE method is a probabilistic clustering method, each data point has a probability of assignment to each of the clusters. For a data point , the probability of assignment to each of the clusters has probability values for number of clusters. In the ideal case, all of the probability values except the true cluster is equal to , and the true cluster has a value of . Figure S9 separately shows the histogram of probability distributions of all of the data points for a cluster. For instance, for cluster , the probability distributions are accumulated at probability values and . In other words, with high certainty, cluster is differentiated from the others. None of the data points is assigned to clusters . Note that although the network is initially trained for clusters, it is not possible to separate more than clusters under the specified loss function. This suggests a way to find the inherent number of clusters, i.e., metastable states, provided that is chosen larger than that true value.
Cluster ID’s are obtained after a thresholding step as explained in Section 2.3.1. Figure S10 shows the cluster populations. Cluster -1 indicates the datapoints that are not assigned to any of the clusters.
Figure S11 shows the inter-cluster distributions.
To further characterize the clustering of secondary structures, we separately calculated dRMSDs with respect to three reference structures: helix (hel), hairpin-like (hp), and extended (coil). Figure 11(a) presents both the reference structures (right) and corresponding dRMSD distributions (left). The first and the second small peaks in the dRMSD distribution represent helical conformations, while the peak corresponding to dRMSD values between hints at the presence of the hairpin-like structures. Note that there is an offset in dRMSD values due to (i) the scarcity of the well-defined hairpins in the trajectory data, and (ii) the subjectivity involved in choosing the reference structures. By plotting the two-dimensional free-energy surface along dRMSD and dRMSD, shown in Figure 11(b), the distinct secondary structures can be separated. The conformations with dRMSD values below 1.8 are helical, whereas the minimum in the upper right with dRMSD greater than 4 is comprised of extended structures. The energy minimum in the middle (enclosed in the region with dRMSD values between and , and dRMSD values between and ) contains hairpin-like structures. Figure 11(c) presents inter-cluster free-energy surfaces along dRMSD and dRMSD, generated by considering only conformations within a single cluster.
In addition to and dRMSD, the radius of gyration distribution is also analyzed. measures the mass-weighted deviations from center of mass, and gives an idea on the overall spread and compactness of the molecule, and is calculated as
where is the mass of atom , the coordinates of atom , and is the coordinates of the center of mass. Figures 12(b), 12(a), 12(b), and 12(c) show the heat map of dRMSD, dRMSD, dRMSD, and R on the FEL obtained via the GMVAE, respectively.
As a final characterization of the clustering, we constructed an MSM directly from the discretized trajectories of GMVAE cluster indices. Although thresholding was applied in the results presented here (practically similar to coring methods for constructing kinetic models ), we found that this procedure had negligible effect on the accuracy of the resulting MSM. Figure 13(a) presents implied timescale test. The kinetic model can resolve two of longest characteristic processes and demonstrates reasonable convergence, although there is a small increase in the longest timescale with increasing lag time. This subtle discrepancy already indicates that there may be some issues with the accuracy of the kinetic model. An MSM is constructed at lag time to balance between the convergence of the timescales and the resolution of shorter timescale processes. Figure 13(b) presents the CK test from this model. There are significant errors in the description of probability decay from each of the metastable states (i.e., clusters), especially states 0 and 1. First, we note that coarse-grained MSMs (i.e., MSMs built on a small number of metastable states) are often not expected to be quantitatively accurate due to difficulties in accurately defining the dividing surfaces between states . However, we anticipate that it should be possible to make a more accurate coarse-grained MSM for this particular simulation trajectory. The discrepancies in the model can then originate from two coupled problems: (i) the GMVAE latent space definition places structures close together that are kinetically distinct (i.e., there are hidden barriers) or (ii) the GMVAE clustering fails to identify/separate distinct metastable states. The FEL within the latent space (Figure 10(a)) contains clearly separated basins that are not identified as unique clusters by the GMVAE. In particular, within clusters 0 and 1, there seems to be and separate states, respectively. Figure S11
shows that cluster 0 (1) contains structures with a range of helicities ranging from 0.46-1.0 (0.15-0.69). According to the conventional picture of the helix-coil transition, the overarching kinetics can be described by two timescales: (i) the rate at which a single helical segment is formed and (ii) the elongation rate of helical segments along the chain. By grouping together conformations with a single helical segment and several helical segments, the GMVAE has convoluted these two timescales, resulting in non-Markovianity in the kinetics described on these clusters. To further clarify the source of these errors, we constructed an MSM in the conventional way, directly from the latent space distribution. More specifically, we applied k-means algorithm withcluster centers, and then applied PCCA+ . In order to enable comparison, we continued with the previous number of metastable states (). The CK test for the resulting model with lag time (obtained from the implied timescale test, Figure 15(b)) is presented in Figure 15(c), and demonstrates slightly improved accuracy.
2D TICA projections are obtained at lag time steps.
We also considered an alternative coarse-grained model, with energetic prefactors , , and , while performing simulations at a temperature of 300 K.
2D TICA projections are obtained at lag time steps.