Variational Autoencoder for Dimensionality Reduction of Time-Series
Often the analysis of time-dependent chemical and biophysical systems produces high-dimensional time-series data for which it can be difficult to interpret which individual features are most salient. While recent work from our group and others has demonstrated the utility of time-lagged co-variate models to study such systems, linearity assumptions can limit the compression of inherently nonlinear dynamics into just a few characteristic components. Recent work in the field of deep learning has led to the development of variational autoencoders (VAE), which are able to compress complex datasets into simpler manifolds. We present the use of a time-lagged VAE, or variational dynamics encoder (VDE), to reduce complex, nonlinear processes to a single embedding with high fidelity to the underlying dynamics. We demonstrate how the VDE is able to capture nontrivial dynamics in a variety of examples, including Brownian dynamics and atomistic protein folding. Additionally, we demonstrate a method for analyzing the VDE model, inspired by saliency mapping, to determine what features are selected by the VDE model to describe dynamics. The VDE presents an important step in applying techniques from deep learning to more accurately model and interpret complex biophysics.READ FULL TEXT VIEW PDF
Inference, prediction and control of complex dynamical systems from time...
Variational auto-encoder frameworks have demonstrated success in reducin...
Learning continuous-time stochastic dynamics from sparse or irregular
Variational autoencoders (VAEs) are widely used deep generative models
System identification of complex and nonlinear systems is a central prob...
The increasing efficiency and compactness of deep learning architectures...
We propose a Variational Time Series Feature Extractor (VTSFE), inspired...
Variational Autoencoder for Dimensionality Reduction of Time-Series
. Such simulations can have thousands of degrees of freedom, making it crucial to have meaningful and statistically robust methods to extract underlying dynamical processes.
The dynamics of molecular systems are often represented using the dynamical propagator approach . Given an ensemble of particles at time
distributed in phase space with a given probability distribution, we seek to describe a propagator, an operator that can describe the new distribution of the ensemble, , given some lag time . When
is chosen such that these probabilities are independent of the history of the system, the model is said to be Markovian. Many methods have been developed to compute approximations to the propagator of a molecular system from simulation data, including time-structure-based independent component analysis (tICA)[5, 6, 7] and extensions (kernel tICA , sparse tICA ), Markov State models (MSM) , VAMPnets , soft-max MSMs , and diffusion maps .
Any method for approximating the dynamics of a complex system has two objectives: to adequately represent complexity in the form of model nonlinearity and to be interpretable, that is, to be readily analyzable for feature importance. In Figure 1, we indicate how several commonly-used methods for dimensionality reduction of dynamical systems compare in terms of achieving these two aims. Complexity and interpretability often come at the expense of each other. For instance, kernel methods such as kernel tICA [8, 14] improve the ability to capture nonlinear effects of features in dynamics over linear methods; however, identifying biophysical meaning in coordinates in an implicit kernel space remains a challenge. Conversely, standard tICA and sparse tICA allow for more precisely identifying relevant biophysical features, but the linearity of tICA limits the complexity of dynamics it can represent.
. The term “variational” stems from this stochasticity: the autoencoder is an implementation of variational Bayesian inference with a Gaussian prior, which maximizes the lower bound on the log-likelihood of the observed data.
Recently, the autoencoder framework has been extended to model time-series data [24, 25, 26, 27, 28, 29, 23]. Analysis in these applications typically involves mapping time-series data to latent spaces with the same dimensionality as the length of the initial time-series data and has not focused on approximating a propagator for the time-series data; however, there are a couple of notable exceptions. Doerr and De Fabritis  recently compared a simple autoencoder to other methods for dimensionality reduction of biophysical simulation data. Wehmeyer and Noe introduced a time lag into an autoencoder (TAE) framework to describe dynamics . Interestingly, they demonstrate that in the limit of a single linear hidden layer, the tICA solution can be attained.
In this work, we extend the traditional VAE architecture to approximate a propagator for time-series data in an architecture denoted as a Variational Dynamics Encoder (VDE). This represents the first use of a time lag within a variational autoencoder to our knowledge. Additionally, we introduce a novel “autocorrelation” loss function, which is inspired by the variational approach to conformational dynamics (VAC). We demonstrate that this approach yields models with more explanatory power than linear dimensionality reduction techniques in both the Müller-Brown potential and the folding landscape of the villin headpiece subdomain. We also explore the generative capability of the VDE as a propagator of dynamics and show that, as implemented, it is unable to reliably capture thermodynamics at differing temperatures. Finally, we demonstrate a novel analysis method, inspired by saliency mapping in neural nets for visual classification [31, 32, 33], to lend interpretation to VDE models. This combination of using the VDE with saliency mapping creates a framework that enables nonlinear combinations of features while remaining interpretable.
The architecture of the VDE, as seen in Figure 2, closely resembles that of a VAE; however, the training procedure is slightly modified to suit time-series data [29, 23]. The most significant modification being that featurized data, , at some timepoint, , are fed into the network in order to make a prediction of the state of the system, , at a future timepoint, , where is some user-selected lag time such that the dynamics of the system is Markovian. As with a traditional VAE, the network can be subdivided into three parts: the encoder network; variational layer, ; and the decoder network.
The encoder network is a deep neural network (DNN) with non-linear activation functions and a user-selected number of hidden layers, which eventually bottlenecks into the one-dimensional latent space,. In this way, the encoder network functions as a non-linear dimensionality reduction of . The latent space is then perturbed by Gaussian noise within the -layer, with mean parameter,
, variance parameter,, and arbitrary scaling, , to generate , as described by Kingma and Welling . Finally, the decoder network, also a DNN, mirrors the encoder network in architecture by using to generate , a prediction of how the system will evolve after a duration of .
Once the VDE has been trained, it can be used for both dimensionality reduction and synthetic trajectory generation. During dimensionality reduction, only the encoder network is necessary, which provides a direct mapping of . During trajectory generation, the entire VDE network is needed. An initial set of features, , is fed through the network to generate , the predicted state after a duration of . This can be done iteratively to generate an arbitrarily long trajectory of features exhibiting thermodynamics consistent with that of the original system used during training. In over to overcome the model’s insensitivity to -layer used during training, we recommend that the noise scaling, , is increased such that .
The VDE is quantitatively evaluated by calculating the sum of two loss functions: reconstruction loss and autocorrelation loss:
The first of these two, reconstruction loss, attempts to quantify how well the VDE approximates the state of the system at , given the true state of the system at time [19, 15]. In doing so, we are essentially evalutating the ability of the latent space to approximate the Markovian propagator after a single lag time. This can be done by considering the mean squared error between the predicted propagation, , and the true propagation,
, along with the Kullback–Leibler divergence between latent space priors that generate:
are learnable parameters representing the mean and standard deviation of the Gaussian prior applied to the latent space, respectively. Together, these complementary loss terms capture a trade-off between model complexity and simplicity of the Gaussian prior. Reconstruction loss pushes the model towards having high fidelity to the training data, while the Kullback–Leibler divergence acts as a regularization term to ensure that the latent space behaves as a Gaussian emission.
The autocorrelation loss attempts to optimize the network towards a more complete representation of the long time-scale kinetics observed within time-series. Although minimization of the reconstruction loss has the potential to recover these dynamical processes , we find that in some cases, such as in Section 3.3, it alone is not sufficient. In order to improve model convergence, we borrow from the VAC 
, a specific application of the variational principle from quantum mechanics adapted for Markov modeling and a useful tool for parameter selection. The variational principle states that, in the limit of infinite data, no process can be identified in the data that is slower than the true process. If we interpret the variational principle as the measure of the quality of this approximation, the phase-space decomposition that leads to a linear model with larger leading dynamical eigenvalues is consequently the better phase-space decomposition. In the limit of a single linear decomposition of phase-space, there is only one eigenvalue to consider, which is equivalent to the autocorrelation of :
where and are the sample mean and population mean of the latent space for a particular batch of data, respectively. For linear models, this leads only to a first-order approximation of slowest process; however, by incorporating this into the VDE’s loss function, we take advantage of the deep encoder as a general approximator to the slowest processess found within in our data . Algorithm 1
outlines how these two losses are calculated and used for backpropagation in practice. Note that the data is split into many smaller batches during training, withas input variable and
as the target variable, to take advantage of stochastic gradient descent methods. We also recommend pre-processing features—either via standardization or median-centering and scaling by interquartile ranges—to prevent the reconstruction loss from overpowering the autocorrelation loss.
We first apply the VDE framework to the well-studied 2-D Müller-Brown potential and demonstrate it can adequately describe the dynamics of this simple system. Figure 3 shows results for a) the VDE, b) tICA, and c) PCA. We note that while tICA and PCA both identify the same dominant linear coordinate, representing diffusion from minor to major basin, the VDE generates a non-linear projection that is able to distiguish these basins more clearly, as well as the transition region.
To establish an unbiased assessment of the VDE’s performance compared to tICA or PCA, we constructed MSMs using identical hyperparameters and compared generalized matrix Raleigh quotients (GMRQs) of the slowest process as a scoring metric. The VDE achieves a slightly higher mean GMRQ score () than tICA () or PCA () on held-out data, suggesting that it is better able to represent the slowest timescale of this system.
As VAEs are regarded as a generative model, we consider the relationship between the VDE and the propagator function. When trained on the Müller-Brown potential, with Joules, the VDE is able to generate “fake” trajectories with some similarities in thermodynamics to the original simulations, as seen in Figure 4 (pink curve). Furthermore, when we modulate the effect of the -layer by adjusting the scaling parameter , we are also able to mimic some effects of changing the simulation temperature without having to re-train the VDE. Figure 4b demonstrates that when decreasing (dark blue and purple curves) or increasing (orange and yellow curves) , the VDE is able to adjust barrier heights in a similar fashion to what is observed in simulation, shown in Figure 4a. However, we find that this fidelity to simulation is lacking in transition regions and previously unobserved regions of phase-space, where the VDE does a poor job of recapitulating true thermodynamics.
Also of note is the case of (not shown in Figure 4b), where the VDE behaves essentially as an indicator function, reporting which basin a given frame will eventually diffuse towards in a low temperature simulation. As is increased, the VDE must decide which basin to push the now heat-bathed system towards and more realistic dynamics can be observed. Such behavior may be interpreted as a temperature-dependent propagator, whereby the VDE has learned some of the underlying thermodynamic characteristics of the system; although, there seems to be a strong attraction to certain basins (e.g. region 2) which is not observed in simulation. Because of this attraction, increasing leads to the raising of intermediate basins towards realistic free energies by the VDE rather than the lowering one might expect when raising the temperature of a simulation. We recommend greatly increasing , as described in Section 2.1, when generating synthetic trajectories due to this trend.
We next apply the VDE to pairwise alpha-carbon (C) contacts in order to model the folding process of the villin headpiece subdomain. Here, we aim to assess the quality of the VDE as a dimensionality reduction technique for protein folding by quantifying how well a MSM constructed from VDE-transformed data separates relevant timescales and distinguishes basins within the landscape. With these metrics in mind, the VDE appears to represent the folding landscape well and can even out-perform tICA using similar MSM hyperparameters.
Figure 5a depicts trajectory data projected onto the slowest two tICA processes (tICs) from an optimized tICA model  and colored by the projection onto the VDE latent coordinate. In the optimal tICA model, 2 tICs are needed to capture both the folding and a prominent misfolding process. The first tIC is unable to completely separate unfolded and folded state, whereas the second tIC distinguishes the folded and unfolded state but is unable to distinguish the folded and misfolded state. In contrast, the VDE latent coordinate is able to discriminate between all three states: folded, unfolded, and misfolded. By comparing the free energies of the VDE latent space (Figure 5c) and the first tIC (Figure 5d), we observe that the VDE coordinate has a narrower basin of folding than that of the first tIC, indicating the VDE latent coordinate more sharply resolves the folding basin than the first tIC does.
To test the benefit of using the autocorrelation loss discussed in Section 2.2, we trained models of villin using only the reconstruction loss and no autocorrelation loss. The projection of the optimal model with no autocorrelation loss is portrayed in Figure 5b. In this projection, there is minimal differentiation between different parts of the landscape. This highlights the necessity of incorporating an autocorrelation loss into the VDE loss function.
MSMs for the villin landscape were constructed from both the VDE model and the optimized tICA model. Comparing these models indicates that the VDE model identifies a slower timescale than the tICA model. Figure 5e portrays the timescales of the slowest five processes identified by MSMs built from the VDE projection, our optimized tICA model, and a tICA model built with one tIC component. The timescale of the slowest process in the MSM from the VDE projection is nanoseconds, whereas the timescale of the slowest process in the optimal tICA model is nanoseconds. According to the variational approach to conformational dynamics, as described in Section 2.2, a model with longer timescales should be closer to modeling the true dynamics of the system.
As noted in Figure 1, nonlinear methods for time-series analysis tend to sacrifice model interpretability. Linear tICA provides “loadings” on each input feature for each slow mode. Thus, the absolute magnitude of these loadings can be used to understand holistic protein dynamics at the atomic scale. To make VDEs more interpretable, we designed a novel variant to saliency maps (see Section 5.3) to gain insight into how the network operates and propagates protein configurations at a particular lag time.
To perform the saliency analysis, we computed a median value for the derivative of the residual between villin’s misfolded and folded basin with respect to its input contact features. As shown in Figure 6b, these saliency maps for villin found 5 important C contacts. The 5 contacts (indicated with green lines in Figure 6b), all involve contacts to residues around Asn19. Remarkably, we can also integrate the saliency scores for each atomic feature to infer feature importances at the residue scale. The residue importance Figure 6b-c can be used to potentially design new molecular simulations and biophysical experiments. For example, in the case of villin, our model predicts distances to Asn19 as being critical for movement out of the misfolded partially helical state (Figure 6b). Mutating this residue to a proline or glycine could potentially be used to prevent the system from sampling the misfolded state. A potential drawback for this method is it requires sufficient knowledge of the system to identify a relevant path to investigate the corresponding initial and final conformations. This can be accomplished by either some empirical analysis, clustering, or simply sampling conformations at the minima and maxima of the latent space.
In this work, we have introduced a variational autoencoder to analyze dynamical processes by incorporating a time lag into the traditional autoencoder structure, introducing an autocorrelation loss during training, and leveraging the Gaussian noise introduced into the latent space during training, dimensionality reduction, and synthetic trajectory analysis. Furthermore, we have introduced a saliency mapping approach inspired by advances in deep learning in order to interpret which features contribute to the identified reaction coordinate.
We demonstrate that the VDE is able to outperform state-of-the-art methods, such as tICA, in describing slow dynamics in both the 2-D Müller-Brown potential and protein folding. Using the 2-D system, we probe the generative nature of the VDE, showing that it can generate realistic thermodynamics and has some ability to extrapolate dynamics at temperatures it has not observed. Although this portion of our study is not quantitatively rigorous, we expect that a better understanding of how VDE parameters relate to simulation parameters might lead to using the VDE to cheaply generate reliable data for different simulation conditions and perhaps even protein mutants. In the more complex case of protein folding, we show the utility of the VDE in understanding the conformational landscape of the villin headpiece domain, which is non-trivial due to the prominent misfolded state observed. The latent space of the VDE captures the transitions among misfolded, unfolded, and folded, and a MSM constructed using the VDE projection exhibits a significantly longer timescale for the slowest process than the optimized one constructed from tICA-transformed data. We also showed that we can peer into the “black box” that is the VDE via protein saliency mapping, which provides biophysical insights into the network’s decision-making. For villin, we identify important C contacts that we predict potentially play a role in misfolding-folding transitions. We anticipate that such results will prove useful in experimental design, such as in FRET experiments, to decide how to effectively probe a protein to observe conformational change.
While the VDE shows much promise, there are a few reasons why we cannot recommend it to completely replace previous methods, such as tICA, just yet. First, when training deep autoencoders using a autocorrelation loss (i.e. as inspired by the VAC), there is a noticeable dependence on batch size that arises during training. The autocorrelation, as well as the related variational loss, attempts to calculate global equilibrium statistics, such as the exchange timescale for the slowest process. However, for finite batch sizes, we might only observe a single event in that process within a given batch. This may lead to underestimating the computed statistics since the network has no information about the rest of the dataset. This problem does not arise in tICA or MSMs because timescales and other global statistics are only estimatedafter all the data has already been processed. Another issue with using the autocorrelation loss, as implemented, arises from the reality that many processes can occur with similar timescales. Each of these processes will be assigned highly similar autocorrelations, and thus might lead to volatile training; although, we believe our compound loss function can somewhat attenuate this issue, since the network is designed to keep track of global transition dynamics in addition to fitting to the slow processes.
All in all, VDEs and recent related work[23, 22] herald exciting opportunities for bridging Markov models and deep learning. We believe the expressive power of neural networks provides a natural solution to the choice-of-basis problem that plagues many Markovian analyses, while the strong theoretical underpinnings behind MSMs allow us to select and potentially even validate cross-validate neural architectures [9, 37], ultimately allowing us to address fundamental questions in biophysics.
We generated 10 independent simulations of the 2-D Müller-Brown potential governed by the following equation:
where Joules, meters-squared per second, and is a delta-correlated Gaussian process with zero mean, and is defined as:
where ; ; ; ; ; and as suggested by Müller and Brown . Using the Euler-Maruyama method for numerical integration and a time step of 0.1, we produced ten unique trajectories with
time steps, saved every 100 steps. The initial positions were sampled via a uniform distribution over the box:.
VDEs for the Müller-Brown potential were trained with a lag time of 10 time steps; 3 hidden layers with 256 nodes each; the Swish activation function ; -value of ; batch size of 100; dropout rate of 30%; and a learning rate of . We note that these parameters were not optimized using automated hyperparameter selection. Gradient descent was performed with the Adam optimizer 
. Models were trained for 50 epochs, at which point the losses were observed to be converged. Prior to training, trajectories were preprocessed by subtracting their overall median values and scaling by inter-quartile ranges.
In constructing MSMs for the Müller-Brown potential, the scaled trajectories were then subject to dimensionality reduction using principal component analysis (PCA) , time-structure based independent component analysis (tICA) [5, 6, 7], and the pre-trained VDE, in order to generate one-dimensional representations of the system’s dynamics. We then partitioned the representations into twelves clusters using the mini-batch -means algorithm [41, 42]. Finally, the clusters were used to construct a maximum-likelihood estimated (MLE) reversible Markov state model (MSM) 
with one timescale. A lag time of 10 time steps was chosen for both MSM construction and dimensionality reduction, as the resulting models provided optimal convergence of implied timescales. The MSMs were then evaluated 100 randomly seeded hold-out datasets to generate unbiased GMRQ scores and standard errors. All trajectory generation and analyses were performed with MSMBuilder and MSMExplorer .
Finally, in order to generate “fake” trajectories using the VDE, we randomly sampled initial positions via a uniform distribution, as described above, and iteratively propagated these coordinates through the VDE for 1,000 steps, equivalent to 10,000 integrator steps. This was done for five scaling values, , evenly sampled in logspace between and to understand how the -layer affects propagation.
We demonstrate the utility of the VDE method in characterizing the folding landscape of villin headpiece domain (pdb: 2f4k), a widely-studied 35-residue fast-folding protein, referred to henceforth as villin. Simulation data for villin was generated by Lindorff-Larsen et al.. The simulation length was is
s and is strided at 2 ns for analysis. Ccontacts were used for featurization [46, 35]. VDEs for villin were trained with a lag time of 44 ns, selected to be the same as that in the optimal tICA model. Other than expanding the number of hidden layers nodes to 1024, the training procedure was identical to that of Section 5.1. The VDE was compared to an optimized tICA model for villin, as featurized with C contacts, that was identified via hyperparameter optimization . Husic et al. have indicated that C contacts are a useful featurization for representing folding processes , hence the selection of this featurization. In this model, a tICA lag time of 44 ns, 4 tICA components, and kinetic mapping  were selected according to hyperparameter optimization. To construct MSMs on tICA-transformed and VDE-transformed data, analogous steps as for the Müller potential in Section 5.1 were performed. Mini-batch -means clustering was performed with clusters for both sets of data. This was the optimal cluster number identified for tICA, and hyperparameter searching showed little influence of cluster number on MSMs from VDE-transformed data. MSMs for both tICA- and VDE- transformed models were constructed with 50 ns lag time. To obtain error estimates for MSM equilibrium populations and MSM timescales, 100 rounds of bootstrapping were performed over the original set of trajectories. The resulting ranges of values were used for error bars.
Saliency maps [31, 32, 33] were originally proposed for looking at spatial support for varied classification problems. For image data, they find spatial features that a network looks for during classification, i.e., by asking how much does any individual pixel contribute to the final prediction. This is done by back-propagating from the desired class score through the network and into the image pixels. Similar to tICA loadings, the magnitude of the derivative can then be used to gauge feature importance per output class. An alternative closely related method, namely guided back-propagation, only propagates the positive derivatives through the network.
Saliency maps were designed for classification algorithms and thus needed to be modified for our application (Algorithm 2). Briefly, we first generate a faux two-step trajectory starting from a random protein conformation, for instance, a misfolded state, and going to the desired protein conformation, for instance, the folded state. The misfolded state is propagated through the network, and the residual to the folded state (Figure 6a) is propagated back to obtain loadings on individual distances. Ideally, this is done for a large number of possible misfolded to folded transitions to obtain robust saliency maps. The median values for each feature across all of these maps can then be integrated to obtain residue level statistics or rank ordered to find important features.It is worth noting that our method is different from classical saliency scoring whereby only the desired class label score is propagated backwards.
We note that both the VDE’s noise parameter and the autocorrelation loss should be set to 0 for consistent results and numerical stability. We also recommend computing the saliency scores multiple times across many configurations and averaging out the results. Lastly, we note that the protein saliency maps can be used in a variety of different protein deep learning algorithms, including VAMPNets  and TAE .
CXH, HKWS, MMS, and BEH performed analysis and wrote the manuscript. CXH, HKWS, MMS, BEH, and VSP conceived of the methodology and edited the manuscript. VSP supervised the project.
We would like to thank J. Chodera, P. Eastman, E. Feinberg, and R. Sharma for insightful discussions. We thank A. Peck for help copy-editing. We acknowledge funding from NIH grants U19 AI109662 and 2R01GM062868 for their support of this work. CXH and HKWS acknowledge support from NSF GRFP (DGE-114747). M.M.S would like to acknowledge support from the National Science Foundation grant NSF-MCB-0954714. This work used the XStream computational resource, supported by the National Science Foundation Major Research Instrumentation program (ACI-1429830), as well as the Sherlock cluster, maintained by the Stanford Research Computing Center.
VSP is a consultant and SAB member of Schrodinger, LLC and Globavir, sits on the Board of Directors of Apeel Inc, Freenome Inc, Omada Health, Patient Ping, Rigetti Computing, and is a General Partner at Andreessen Horowitz.
A deep learning framework for financial time series using stacked autoencoders and long-short term memory.PloS one 12, e0180944 (2017).
Osprey: Hyperparameter optimization for machine learning.J. Open Source Software 1, 00034 (2016).
Web-scale k-means clustering.In Proceedings of the 19th international conference on World wide web - WWW ’10, 1177 (ACM Press, New York, New York, USA, 2010).
MSMExplorer: Data visualizations for biomolecular dynamics.J. Open Source Software 2 (2017).