Capabilities and Limitations of Time-lagged Autoencoders for Slow Mode Discovery in Dynamical Systems

by   Wei Chen, et al.
The University of Chicago

Time-lagged autoencoders (TAEs) have been proposed as a deep learning regression-based approach to the discovery of slow modes in dynamical systems. However, a rigorous analysis of nonlinear TAEs remains lacking. In this work, we discuss the capabilities and limitations of TAEs through both theoretical and numerical analyses. Theoretically, we derive bounds for nonlinear TAE performance in slow mode discovery and show that in general TAEs learn a mixture of slow and maximum variance modes. Numerically, we illustrate cases where TAEs can and cannot correctly identify the leading slowest mode in two example systems: a 2D "Washington beltway" potential and the alanine dipeptide molecule in explicit water. We also compare the TAE results with those obtained using state-free reversible VAMPnets (SRVs) as a variational-based neural network approach for slow modes discovery, and show that SRVs can correctly discover slow modes where TAEs fail.



There are no comments yet.


page 5


Inference of modes for linear stochastic processes

For dynamical systems that can be modelled as asymptotically stable line...

Learning Deep Neural Network Representations for Koopman Operators of Nonlinear Dynamical Systems

The Koopman operator has recently garnered much attention for its value ...

Extraction of Discrete Spectra Modes from Video Data Using a Deep Convolutional Koopman Network

Recent deep learning extensions in Koopman theory have enabled compact, ...

Deep Learning Enhanced Dynamic Mode Decomposition

Koopman operator theory shows how nonlinear dynamical systems can be rep...

iLQR for Piecewise-Smooth Hybrid Dynamical Systems

Trajectory optimization is a popular strategy for planning trajectories ...

Discovery of slow variables in a class of multiscale stochastic systems via neural networks

Finding a reduction of complex, high-dimensional dynamics to its essenti...

Dynamical Kinds and their Discovery

We demonstrate the possibility of classifying causal systems into kinds ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Estimation of the slow (i.e., maximally autocorrelated) collective modes of a dynamical system from trajectory data is an important topic in dynamical systems theory in understanding, predicting, and controlling long-time system evolution Andrew et al. (2013); Mardt et al. (2018); Pathak et al. (2018); Ye et al. (2015); Giannakis and Majda (2012); Korda and Mezić (2018a); Sharma et al. (2016); Korda and Mezić (2018b); Koopman (1931); Mezić (2005). In the context of molecular dynamics, identification of the leading slow modes is of great value in illuminating conformational mechanisms, constructing long-time kinetic models, and guiding enhanced sampling techniques Noé and Nuske (2013); Nüske et al. (2014); Schütte et al. (2001); Prinz et al. (2011); Schwantes and Pande (2015, 2013); Pande et al. (2010); Chodera and Noé (2014, 2014); Pérez-Hernández et al. (2013); Schwantes and Pande (2015); Harrigan and Pande (2017); Mardt et al. (2018); Hernández et al. (2018); Sultan et al. (2018); Wehmeyer and Noé (2018)

. Many machine learning models have been applied to learn slow modes from molecular trajectory data, some falling into the category of more traditional techniques, including time-lagged independent component analysis (TICA)

Pérez-Hernández et al. (2013); Noé and Nuske (2013); Nüske et al. (2014); Noé and Clementi (2015); Noé et al. (2016); Pérez-Hernández and Noé (2016); Schwantes and Pande (2013); Klus et al. (2018); Husic and Pande (2018), kernel TICA Schwantes and Pande (2015); Harrigan and Pande (2017), Markov state models (MSMs) Prinz et al. (2011); Pande et al. (2010); Schwantes and Pande (2013); Husic and Pande (2018); Trendelkamp-Schroer et al. (2015); Sultan and Pande (2017); Mittal and Shukla (2018); Harrigan et al. (2017); Wehmeyer et al. (2018); Scherer et al. (2018), while others employ more recently-developed deep learning models, including time-lagged autoencoders (TAEs) Wehmeyer and Noé (2018), variational dynamics encoders (VDEs) Wayment-Steele and Pande (2018); Hernández et al. (2017); Sultan et al. (2018), variational approach for Markov processes nets (VAMPnets) Mardt et al. (2018), and state-free reversible VAMPnets (SRVs) Chen et al. (2019)

. These approaches all employ variants of deep neural networks, but differ in the details of their architecture and loss function: TAEs are a regression approach that minimize time-lagged reconstruction loss, VAMPnets and SRVs are variational approaches that maximize autocorrelation of the slow modes, and VDEs can be conceived as a mixture of the regression and variational approaches. Although there are existing theoretical guarantees of slow modes discovery for variational approaches

Chen et al. (2019); Noé and Nuske (2013); Nüske et al. (2014); Schütte et al. (2001), similar theoretical guarantees for regression approaches is currently limited to linear cases Wehmeyer and Noé (2018). Specifically, linear TAEs are known to be equivalent to time-lagged canonical correlation analysis, and closely related to TICA and kinetic maps Wehmeyer and Noé (2018); Wu and Noé (2017); Hotelling (1936); Noé and Clementi (2015); Pérez-Hernández et al. (2013). In this work, we aim to fill this gap by presenting a theoretical and numerical analysis of the capabilities and limitations of TAEs as a nonlinear regression approach for slow mode discovery.

Results and Discussion

Consider a trajectory of a dynamical system where is a system configuration, or a derived featurization of the configuration, at time . We define the slowest mode for a given lag time as the functional mapping that maximizes autocorrelation for a lag time ,


where is the mean-free slow mode and

is its variance. This definition is closely related to the dominant eigenfunction of the transfer operator

Chen et al. (2019); Noé and Nuske (2013); Nüske et al. (2014); Schütte et al. (2001). A TAE seeks to estimate the slowest mode by training a time-lagged autoencoder to encode a system configuration at time into a low-dimensional latent space , and then decode this latent space embedding to reconstruct the system configuration at time . The operational principle is that minimizing the reconstruction loss at a lag time promotes discovery of slow modes within the latent space. In the following sections, we define under what conditions TAEs are able to correctly learn the slowest mode and when they will fail to do so. To simplify our discussion, we restrict our analysis to recovery of the leading slowest mode of the system. A schematic of a fully-connected feedforward TAE with a 1D latent space is presented in Fig. 1.

Figure 1: Schematic of a time-lagged autoencoder (TAE). The input configuration at time is fed into an encoder network to generate the latent space encoding . This encoding is passed into a decoder network to generate output which aims to reconstruct the configuration at a later time

. Training is performed by backpropagation and is terminated when the loss

is minimized. Image constructed using code downloaded from with permission of the author Kjell Magne Fauske.

.1 Linear time-lagged autoencoders (TAEs) can learn the slowest mode by employing whitened features

Consider a sufficiently long 2D trajectory for a stationary process such that the mean and variance do not change over time and the two components and are mean-free and mutually independent. Let the autocorrelation and the variance for component be and , respectively. Let be the latent variable, where is the encoder mapping for TAE, and be the reconstructed time-lagged output, where is the decoder. The TAE seeks to find the encoding and decoding functional mappings and to minimize the time-lagged reconstruction loss,


Let us assume that , which defines to be a slower component than . If (denoting that is a bijection of

, which in the linear case is a non-trivial linear transformation), the reconstructed output

should be a linear function of given by,


where and are constants, and the second component is 0 since does not contain information about . The corresponding time-lagged reconstruction loss is given by,

The optimal reconstruction coefficients are given by minimization of Eq. .1 with respect to and ,


and the corresponding minimal loss is given by,


where we employed the substitution . We identify the first term as the “propagation loss”, which increases as autocorrelation decreases and therefore generally increases with lag time . We identify the second term as the “irreducible capacity loss”, which reflects the fact that the one-dimensional latent variable does not contain any information about component , and is independent of lag time. Eq. 6 can be rearranged as,


where is the time-independent total variance of configurations.

If we now consider the case that by an analogous analysis the loss is given by,


From Eq. .1 and Eq. .1, we see that in the time-lagged reconstruction loss there are two contributing factors: variance and autocorrelation. By construction , so it is the objective of the TAE to learn as the slowest mode. However, if is sufficiently large compared to such that,




and the TAE loss is minimized by learning .

We also consider whether it is possible that the optimal linear mode is actually a mixed linear mode where and . It can be shown by an analogous analysis, and recalling that the two components are independent and mean free, that the minimal loss is,


and that this expression reduces to Eq. .1 for and Eq. .1 for as expected.

Using to eliminate , the minimal loss can be simplified to,


which is a monotonic function with respect to . Recalling that all variances and squared autocorrelations constrained to non-negative values, if then the loss is minimized for and , whereas if then the loss is minimized for and . Accordingly, except for the case , the loss is globally minimized by learning one of the pure component modes, the optimum is not a mixture of the two modes, and the analysis presented for the two pure modes is sufficient and complete for the case of a linear TAE with independent components.

This theoretical development demonstrates that a linear TAE is not guaranteed to find the slowest mode in the event that the associated variance of a faster mode is sufficiently large such that its erroneous identification leads to a smaller reconstruction loss. A straightforward solution to this issue is to apply a whitening transformation Wehmeyer and Noé (2018) to the input data such that , and any linear combination with has unit variance. By eliminating the variance of the learned mode as a discriminating feature within the loss functions Eq. .1 and Eq. .1, learning is performed exclusively on the basis of autocorrelation, and the linear TAE can correctly learn the slowest mode by minimizing the reconstruction loss. A rigorous proof that linear TAEs employing whitened features is equivalent to TICA for reversible process and can correctly identify the slowest mode is presented in Ref. Wehmeyer and Noé (2018).

.2 Nonlinear TAEs cannot equalize the variance explained within the input features and so cannot be assured to learn the slowest mode

We now proceed to perform a similar analysis for nonlinear TAEs. To aid in our discussion, we define and introduce the “variance explained” by the latent variable as,


which measures how much variance in the feature space can be explained by . The idea of this concept is as follows. In a standard (non-time-lagged) autoencoder, the optimal reconstruction is and the variance of the outputs is , which explains part of the variance in the feature space while leaving unexplained. In analogy to the linear case, it may be shown that for a long trajectory generated by a stationary process with finite states, the optimal loss for nonlinear TAE is approximately


where is the nonlinear generalization of autocorrelation and is the total variance of the input features. Here is the “propagation loss” and is the “irreducible capacity loss” for the nonlinear case. The details of the proof and related concepts can be found in Section .1 of the Appendix.

From Eq. 14, we see that both variance explained and the autocorrelation contribute to the TAE loss as in the linear case. If a faster mode has much larger variance explained than the true slowest mode, it is possible that nonlinear TAE would learn this faster mode in the latent encoding to minimize the loss, and in this case the nonlinear TAE fails to learn the correct slowest mode.

We demonstrate this idea in the context of “Washington beltway” potential, which consists two circular potential valleys at 0 , separated by a circular barrier of 4 (Fig. 2). The expression for the potential is given by


where . To simulate a particle moving in this potential, we conduct a Markov state model (MSM) simulation to generate a trajectory following the procedure detailed in Ref. Chen et al. (2019). Specifically, we split

into 20-by-200 evenly spaced bins, and run a 5,000,000 step MSM simulation according to transition probabilities from bin

to bin ,


where is the normalization factor that ensures .

Figure 2: Contour plot of the Washington beltway potential, which consists two circular potential valleys at 0 separated by a circular barrier of 4 .

Given the analytical expression for the potential, the slow modes are analytically calculable by computing eigenvectors of the MSM transition matrix. The first six leading slow modes with their implied timescales are presented in

Fig. 3. As expected, the slowest mode corresponds to transitions over the potential barrier radial direction (-direction). Transitions around the circular potential in the polar direction (-direction) appear as a degenerate pair within the second and third slowest modes with an order of magnitude faster timescale. Higher-order modes correspond to mixed - transitions and appear as degenerate pairs due to the circular symmetry in .

Figure 3: First six leading slow modes of the Washington beltway potential with corresponding timescales marked above each subplot. The slowest mode ( = 104,565) corresponds to transitions in the direction over the circular barrier separating the two circular valleys. The next two slowest modes form a degenerate pair ( = = 4678) corresponding to transitions around the circular valleys in the direction. The higher order modes correspond to mixed - transitions and appear as degenerate pairs due to the circular symmetry in .

We now analyze the MSM trajectory using a nonlinear TAE with a [2-50-50-1-50-50-2] architecture and tanh activation functions for encoding and decoding layers and linear activation functions for input/output/latent layers. We supply to the input and output layers

and pairs, respectively, employing a lag time of = 3000 steps. The data are naturally mean-free due to the topology of the potential landscape and are pre-whitened to normalize their variance. In Fig. 4a, we show the contour plot of the latent variable for the system. Clearly, the TAE did not correctly identify theoretical slowest mode, which should be along direction. Instead, since -direction has much larger variance explained than -direction, the TAE is biased towards identifying as slower mode in order to minimize time-lagged reconstruction loss. Therefore the learned latent variable is actually the mixture of and with most contribution coming from .

Figure 4: Slowest mode discovered by TAE and SRV. (a) The TAE incorrectly identifies as the slowest mode since the direction has much larger variance explained and therefore much larger contribution to the TAE loss. (b) The SRV successfully discovers direction as the slowest mode.

To be quantitative, we compare the TAE loss employing and as the latent variable. Due to symmetry, when (denoting that is a bijection of ), the average configuration given is , so and from Eq. 14. Conversely if , the average configuration given is , so and . For , learning as the slow mode results in a lower time-lagged reconstruction loss than learning the true slow mode . We numerically estimate from the simulation trajectory , and , from which we compute and then the estimated loss . The actual training loss is computed to be , showing that the nonlinear TAE approximately learns as the slowest mode. As a side note, if we use an optimal encoding corresponding to a bijective encoding of the feature space (see last part of Section .1 of the Appendix for details), the minimal possible loss is , which implies that is very close to the optimal encoding.

Can we somehow transform the input features to equalize the variance explained? This operation would serve the same purpose as the whitening transformation in the linear case to eliminate the variance explained as a discriminating factor in the time-lagged reconstruction loss and force the TAE to identify the slow mode based on (generalized) autocorrelation alone. We first note that such a transform is not always possible even if we know the theoretical slow modes. For instance, in our Washington beltway potential example, we cannot equalize and since the intrinsic circular symmetry assures that . What about other systems with slow modes that are possible to be theoretically whitened? We note that for all but the most trivial systems we do not know the nonlinear slow modes a priori and so there is no way to equalize the variance explained. Even if we are able to identify putative slow modes using nonlinear dimensionality reduction techniques, if these modes are identified on the basis of anything other than slowness (e.g., variance) then equalization of the variance explained can still lead to incorrect results. If they are identified only on the basis of slowness then we have already obtained the slowest mode and there is no need to apply nonlinear TAE. In short, there is no general procedure to correctly equalize the variance explained within the input features and therefore no way to guarantee that nonlinear TAEs will correctly discover the slowest mode.

.3 State-free reversible VAMPnets (SRVs) correctly identify nonlinear slow modes since the explained variance does not appear in the loss function

We now demonstrate that in contrast to TAEs, state-free reversible VAMPnets (SRVs) can correctly identify the slowest mode using the same input features. Given a trajectory , SRVs employ an artificial neural network to simultaneously discover the optimal nonlinear featurization of the input data , where is the component of the encoder output, and the linear combination of these features that maximizes the squared sum of autocorrelation of the slow modes. Full details of SRVs are presented in Ref. Chen et al. (2019). A schematic of an SRV with =1 corresponding to a 1D latent space embedding is presented in Fig. 5. The corresponding loss function for the 1D SRV is,

Figure 5: Schematic diagram of a 1D state-free reversible VAMPnet (SRV). A pair of input configurations are fed into twin fully-connected feedforward neural network lobes with shared architectures and weights to generate network outputs and . The neural network is trained with backpropagation to minimize the negative autocorrelation . Image constructed using code downloaded from with permission of the author Kjell Magne Fauske.

We apply a SRV model with [2-50-50-1] architecture and tanh activation functions for all layers except input/output layers (an analogous design to the TAE above) to the whitened MSM trajectory on the Washington beltway potential. In Fig. 4b, we show the contour plot of the learned SRV mode that correctly identifies transitions in as the slowest mode of the system. The reason why the SRV correctly identifies slowest mode is that the loss function is exactly equivalent to maximizing the autocorrelation of the learned mode with no contribution from variance explained. Accordingly, we generally recommend to use a SRV rather than a TAE for estimation of the slowest mode. Moreover, we note that when it comes to higher-order slow modes discovery through multidimensional latent variables, it is not recommended to use TAEs since there is no constraint that different components of the latent variable are orthogonal to each other, and it is therefore possible that each component becomes mixture of many slow modes. In SRVs, however, the orthogonality constraints are satisfied naturally in the variational optimization procedure and it can discover a hierarchy of slow orthogonal modes Chen et al. (2019).

.4 The “slow” mode learned by TAEs can be controlled by feature engineering

Following the ideas developed above, we now show that we can design features of a system to mislead nonlinear TAEs to learn a desired mode as the slowest mode simply by assuring that the target mode has much larger variance explained than the true slowest mode. To show this, we employ molecular dynamics simulations of alanine dipeptide as 22-atom peptide that serves as the “fruit fly” for testing new numerical methods in molecular systems (Fig. 6).

Figure 6: Molecular structure of alanine dipeptide with the two dominant backbone dihedral angles and annotated. Image rendered using VMD Humphrey et al. (1996).

It is well known from extensive prior study that the slowest mode for alanine dipeptide corresponds to transitions in the backbone dihedral angle Chen et al. (2019); Trendelkamp-Schroer et al. (2015). Is it possible to design a feature set such that a TAE is misled into learning the backbone dihedral angle as the slowest mode? To do so, consider 2D features given by,


where . The idea is to map to a ring such that encodes polar angle direction while encodes radial direction such that has much larger variance explained than . Scatter plots of the engineered feature space colored by two dihedral angles and are presented in Fig. 7.

Figure 7: Input 2D features for alanine dipeptide transformed from two dihedral angles and . (a) is encoded in radial direction while (b) is encoded in polar angle direction.

We run molecular dynamics simulation for alanine dipeptide in explicit solvent to generate a trajectory of 2,000 ns saving frames every 2 ps to generate a 1,000,000-frame trajectory. Then we use a TAE with [2-50-50-1-50-50-2] architecture and tanh activation functions for encoding and decoding layers and linear activation functions for input/output/latent layers to learn over the mean-free whitened trajectory data with features described above. We show the Ramachandran plot colored by the slowest mode discovered by TAE in Fig. 8a, which verifies that we successfully misled the TAE to learn as the slowest mode. Again, the SRV employing an analogous architecture has no difficulty in correctly identifying the slowest mode (Fig. 8b), which is consistent with the ground truth results of a state-of-the-art Markov state model trained in Ref. Chen et al. (2019) (Fig. 8c).

Figure 8: Slowest modes discovered by (a) TAE, (b) SRV with the features given by Eq. .4 and (c) the ground truth slowest mode learned by a state-of-the-art MSM. Misleading feature engineering causes the TAE to fail to discover the slowest mode whereas the SRV does so correctly using the same input features.

Similarly if we swap and and define our features according to,


then the TAE learns as the slowest mode (Fig. 9a), which is closer to the ground truth MSM results (Fig. 9c), and again the SRV has no difficulty identifying slowest mode with these features (Fig. 9b)

Figure 9: Slowest modes discovered by (a) TAE, (b) SRV with the features given by Eq. .4 and (c) the ground truth slowest mode learned by a state-of-the-art MSM. In this case, favorable feature engineering allows the TAE to correctly learn as slowest mode. Again, the SRV again correctly recovers the slowest mode from the same input features.

.5 The 1D TAE structure can be modified to explicitly equalize the variance explained within the latent space to render it equivalent to a 1D SRV

To resolve the variance explained issue for nonlinear TAE, we can modify its structure. Instead of employing an autoencoding architecture to minimize the time-lagged reconstruction loss, we can instead employ an encoder-only architecture and optimize the time-lagged loss for the “whitened” encoder output. This modification in architecture replaces the standard TAE loss given by Eq. .1 with the modified TAE loss given by,


where is the 1D encoder output for input at time , the variance in the denominator is used to correctly “whiten” the learned slow mode such that the variance (or “variance explained”) does not play a role in learning of the slowest mode.

How does the modified TAE architecture and loss function relate to the 1D SRV? Following a similar derivation to Eq. .1, we have,


where is the autocorrelation for . Therefore Eq. 20 becomes


which, up to a trivial affine transformation, is equivalent to the SRV loss given by Eq. 17. Accordingly, the modified nonlinear TAE with loss given by Eq. 22 is equivalent to a 1D SRV and can correctly identify slow modes without the misleading influence of variance explained.

.6 Variational dynamics encoders (VDEs) learn mixtures of the slowest mode and the maximum variance mode

Variational dynamics encoders (VDEs) employ a variational autoencoder architecture with a loss function comprising both reconstruction loss for inputs/outputs and the autocorrelation loss for the 1D latent variable . The loss function of VDE can be written as Hernández et al. (2018),


where is the reconstructed output, is the time-lagged reconstruction loss, is the autocorrelation loss for , and

is a linear mixing parameter. If we ignore the Kullback-Leibler divergence term

, which measures the similarity of the encoded probability distribution of

to a Gaussian distribution, can be considered a form of regularization, and goes to zero in the case of well-trained variational autoencoders

Doersch (2016), then the loss is exactly the mixture of 1D SRV loss and 1D TAE loss. Since SRVs learn the slowest mode, and TAEs learn a mixture of the slowest mode and the maximum variance mode, in general VDEs also learn a mixture of the slowest mode and the maximum variance mode. As was the case for TAEs, in any applications where we aim to find the slow modes it is not recommended to include terms associated with the explained variance (here within the reconstruction loss) and it is therefore not recommended to use VDEs for this goal.


The TAE and SRV neural networks were constructed in Python using the Keras

Chollet deep learning libraries and training performed on an NVIDIA GeForce GTX 1080 GPU card. The MSM simulations of particle motion over the “Washington beltway” potential were conducted in Python. Simulations of alanine dipeptide in water were conducted using the OpenMM 7.3 simulation suite Eastman et al. (2012, 2017) employing the Amber99sb-ILDN forcefield for the biomolecule Lindorff-Larsen et al. (2010) and TIP3P forcefield for water Jorgensen et al. (1983). The temperature and pressure were maintained at = 300 K and = 1 bar using Andersen thermostat Andersen (1980) and = 1 atm using a Monte-Carlo barostatChow and Ferguson (1995); Åqvist et al. (2004). Lennard-Jones interactions were smoothly switched to zero at a cuttoff of 1.4 nm cutoff, and Coulombic interactions were treated by particle-mesh Ewald Essmann et al. (1995) with a real space cutoff of 1.4 nm and a reciprocal space grid spacing of 0.12 nm.


In this work, we present a theoretical analysis of the capabilities and limitations of TAEs in slow mode discovery. We show that linear TAEs correctly learn the slowest linear mode with whitened features, while nonlinear TAEs cannot be assured of discovering the slowest nonlinear mode since it is not in general possible to perform an equivalent nonlinear “whitening” of the input features to equalize their variance explained. We prove the theoretical bounds for time-lagged reconstruction loss for nonlinear TAE and demonstrate that a faster nonlinear mode could be erroneously identified as the slowest mode if it has significantly higher variance explained. We validate our theoretical analysis in applications to a 2D “Washington beltway” potential and in molecular simulations of alanine dipeptide. We also show how the 1D nonlinear TAE network structure can be modified to become equivalent to the 1D SRV to remove the misleading influence of variance explained and permit variable discrimination exclusively on the basis of autocorrelation. We also show that 1D VDEs mix the loss functions of TAEs and SRVs and therefore also suffer from the misleading influence of variance explained. Accordingly, SRVs serve as a more appropriate tool than TAEs or VDEs for the discovery of slow modes. If variance explained of the modes is also of interest, then TAEs or VDEs can prove useful.


This material is based upon work supported by the National Science Foundation under Grant No. CHE-1841805. H.S. acknowledges support from the Molecular Software Sciences Institute (MolSSI) Software Fellows program (NSF grant ACI-1547580)Krylov et al. (2018); Wilkins-Diehr and Crawford (2018).


  • Andrew et al. (2013) G. Andrew, R. Arora, J. Bilmes,  and K. Livescu, in Proceedings of the 30th International Conference on Machine Learning (PMLR), Vol. 28 (2013) pp. 1247–1255.
  • Mardt et al. (2018) A. Mardt, L. Pasquali, H. Wu,  and F. Noé, Nature Communications 9, 5 (2018).
  • Pathak et al. (2018) J. Pathak, B. Hunt, M. Girvan, Z. Lu,  and E. Ott, Physical Review Letters 120, 024102 (2018).
  • Ye et al. (2015) H. Ye, R. J. Beamish, S. M. Glaser, S. C. H. Grant, C.-H. Hsieh, L. J. Richards, J. T. Schnute,  and G. Sugihara, Proceedings of the National Academy of Sciences 112, E1569 (2015).
  • Giannakis and Majda (2012) D. Giannakis and A. J. Majda, Proceedings of the National Academy of Sciences 109, 2222 (2012).
  • Korda and Mezić (2018a) M. Korda and I. Mezić, Journal of Nonlinear Science 28, 687 (2018a).
  • Sharma et al. (2016) A. S. Sharma, I. Mezić,  and B. J. McKeon, Physical Review Fluids 1, 032402 (2016).
  • Korda and Mezić (2018b) M. Korda and I. Mezić, Automatica 93, 149 (2018b).
  • Koopman (1931) B. O. Koopman, Proceedings of the National Academy of Sciences 17, 315 (1931).
  • Mezić (2005) I. Mezić, Nonlinear Dynamics 41, 309 (2005).
  • Noé and Nuske (2013) F. Noé and F. Nuske, Multiscale Modeling & Simulation 11, 635 (2013).
  • Nüske et al. (2014) F. Nüske, B. G. Keller, G. Pérez-Hernández, A. S. J. S. Mey,  and F. Noé, Journal of Chemical Theory and Computation 10, 1739 (2014).
  • Schütte et al. (2001) C. Schütte, W. Huisinga,  and P. Deuflhard, in Ergodic Theory, Analysis, and Efficient Simulation of Dynamical Systems (Springer, 2001) pp. 191–223.
  • Prinz et al. (2011) J.-H. Prinz, H. Wu, M. Sarich, B. Keller, M. Senne, M. Held, J. D. Chodera, C. Schütte,  and F. Noé, The Journal of Chemical Physics 134, 174105 (2011).
  • Schwantes and Pande (2015) C. R. Schwantes and V. S. Pande, Journal of Chemical Theory and Computation 11, 600 (2015).
  • Schwantes and Pande (2013) C. R. Schwantes and V. S. Pande, Journal of Chemical Theory and Computation 9, 2000 (2013).
  • Pande et al. (2010) V. S. Pande, K. Beauchamp,  and G. R. Bowman, Methods 52, 99 (2010).
  • Chodera and Noé (2014) J. D. Chodera and F. Noé, Current Opinion in Structural Biology 25, 135 (2014).
  • Pérez-Hernández et al. (2013) G. Pérez-Hernández, F. Paul, T. Giorgino, G. De Fabritiis,  and F. Noé, The Journal of Chemical Physics 139, 015102 (2013).
  • Harrigan and Pande (2017) M. P. Harrigan and V. S. Pande, bioRxiv preprint  (2017).
  • Hernández et al. (2018) C. X. Hernández, H. K. Wayment-Steele, M. M. Sultan, B. E. Husic,  and V. S. Pande, Physical Review E 97, 062412 (2018).
  • Sultan et al. (2018) M. M. Sultan, H. K. Wayment-Steele,  and V. S. Pande, Journal of Chemical Theory and Computation 14, 1887 (2018).
  • Wehmeyer and Noé (2018) C. Wehmeyer and F. Noé, The Journal of Chemical Physics 148, 241703 (2018).
  • Noé and Clementi (2015) F. Noé and C. Clementi, Journal of Chemical Theory and Computation 11, 5002 (2015).
  • Noé et al. (2016) F. Noé, R. Banisch,  and C. Clementi, Journal of Chemical Theory and Computation 12, 5620 (2016).
  • Pérez-Hernández and Noé (2016) G. Pérez-Hernández and F. Noé, Journal of Chemical Theory and Computation 12, 6118 (2016).
  • Klus et al. (2018) S. Klus, F. Nüske, P. Koltai, H. Wu, I. Kevrekidis, C. Schütte,  and F. Noé, Journal of Nonlinear Science 28, 985 (2018).
  • Husic and Pande (2018) B. E. Husic and V. S. Pande, Journal of the American Chemical Society 140, 2386 (2018).
  • Trendelkamp-Schroer et al. (2015) B. Trendelkamp-Schroer, H. Wu, F. Paul,  and F. Noé, The Journal of Chemical Physics 143, 11B601_1 (2015).
  • Sultan and Pande (2017) M. M. Sultan and V. S. Pande, The Journal of Physical Chemistry B  (2017).
  • Mittal and Shukla (2018) S. Mittal and D. Shukla, Molecular Simulation 44, 891 (2018).
  • Harrigan et al. (2017) M. P. Harrigan, M. M. Sultan, C. X. Hernández, B. E. Husic, P. Eastman, C. R. Schwantes, K. A. Beauchamp, R. T. McGibbon,  and V. S. Pande, Biophysical Journal 112, 10 (2017).
  • Wehmeyer et al. (2018) C. Wehmeyer, M. K. Scherer, T. Hempel, B. E. Husic, S. Olsson,  and F. Noé, Living Journal of Computational Molecular Science 1, 5965 (2018).
  • Scherer et al. (2018) M. K. Scherer, B. E. Husic, M. Hoffmann, F. Paul, H. Wu,  and F. Noé, arXiv preprint arXiv:1811.11714  (2018).
  • Wayment-Steele and Pande (2018) H. K. Wayment-Steele and V. S. Pande, arXiv preprint arXiv:1803.06449  (2018).
  • Hernández et al. (2017) C. X. Hernández, H. K. Wayment-Steele, M. M. Sultan, B. E. Husic,  and V. S. Pande, arXiv preprint arXiv:1711.08576  (2017).
  • Chen et al. (2019) W. Chen, H. Sidky,  and A. L. Ferguson, The Journal of Chemical Physics arXiv:1902.03336  (in press, 2019).
  • Wu and Noé (2017) H. Wu and F. Noé, arXiv preprint arXiv:1707.04659  (2017).
  • Hotelling (1936) H. Hotelling, Biometrika 28, 321 (1936).
  • Humphrey et al. (1996) W. Humphrey, A. Dalke,  and K. Schulten, Journal of Molecular Graphics 14, 33 (1996).
  • Doersch (2016) C. Doersch, arXiv preprint arXiv:1606.05908  (2016).
  • (42) F. Chollet, “Keras,”
  • Eastman et al. (2012) P. Eastman, M. S. Friedrichs, J. D. Chodera, R. J. Radmer, C. M. Bruns, J. P. Ku, K. A. Beauchamp, T. J. Lane, L.-P. Wang,  and D. Shukla, Journal of Chemical Theory and Computation 9, 461 (2012).
  • Eastman et al. (2017) P. Eastman, J. Swails, J. D. Chodera, R. T. McGibbon, Y. Zhao, K. A. Beauchamp, L.-P. Wang, A. C. Simmonett, M. P. Harrigan,  and C. D. Stern, PLOS Computational Biology 13, e1005659 (2017).
  • Lindorff-Larsen et al. (2010) K. Lindorff-Larsen, S. Piana, K. Palmo, P. Maragakis, J. L. Klepeis, R. O. Dror,  and D. E. Shaw, Proteins: Structure, Function, and Bioinformatics 78, 1950 (2010).
  • Jorgensen et al. (1983) W. L. Jorgensen, J. Chandrasekhar, J. D. Madura, R. W. Impey,  and M. L. Klein, The Journal of Chemical Physics 79, 926 (1983).
  • Andersen (1980) H. C. Andersen, The Journal of Chemical Physics 72, 2384 (1980).
  • Chow and Ferguson (1995) K.-H. Chow and D. M. Ferguson, Computer Physics Communications 91, 283 (1995).
  • Åqvist et al. (2004) J. Åqvist, P. Wennerström, M. Nervall, S. Bjelic,  and B. O. Brandsdal, Chemical Physics Letters 384, 288 (2004).
  • Essmann et al. (1995) U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee,  and L. G. Pedersen, The Journal of Chemical Physics 103, 8577 (1995).
  • Krylov et al. (2018) A. Krylov, T. L. Windus, T. Barnes, E. Marin-Rimoldi, J. A. Nash, B. Pritchard, D. G. Smith, D. Altarawy, P. Saxe, C. Clementi, et al., The Journal of Chemical Physics 149, 180901 (2018).
  • Wilkins-Diehr and Crawford (2018) N. Wilkins-Diehr and T. D. Crawford, Computing in Science & Engineering 20, 26 (2018).
  • Hassoun (1995) M. H. Hassoun, Fundamentals of Artificial Neural Networks (MIT press, 1995).
  • Chen and Chen (1995) T. Chen and H. Chen, IEEE Transactions on Neural Networks 6, 911 (1995).


.1 Theoretical bounds for time-lagged autoencoder reconstruction loss

We present a derivation of the theoretical bounds on the time-lagged autoencoder (TAE) reconstruction loss for a stationary process with finite states. We show that this leads to the approximate expression Eq. 14 in the main text, and that the approximation can be made arbitrarily accurate by employing sufficiently large neural network architectures. Let be the mean-free featurization of the system at time in a finite trajectory that therefore comprises a finite number of states. The process is stationary and the trajectory is long enough such that for any ,

where is the total variance of featurized configurations.

Let be the encoded embedding of , where represents the encoding mapping. Let be the decoded output of , where represents the decoding mapping. The time-lagged reconstruction loss can be written as,


We define the expected evolution after lag time given latent space encoding as,


which can be viewed as the “optimal reconstructed output” given . Note that when , denotes the average system featurization corresponding to a particular latent-space encoding .

To quantify how much variance information of is captured by , we define the variance explained,


which measures the variance of the feature average given the encoding and is consistent with our definition Eq. 13 in the main text.

Now Eq. 25 becomes,

Since we have finite number of possible values, the number of values should also be finite. Therefore we can split the summation over the N-frame trajectory into two parts (sum-splitting trick): first summing over all frames with the same value, then summing over all values. Considering the third term in Eq. .1,

Therefore Eq. .1 becomes,

We can apply the same sum-splitting idea to the third term of Eq. .1, where is total number of frames, and is the number of frames with encoding equal to ,


Therefore Eq. .1 becomes,

If we define the “generalized autocorrelation” as,


then the lower bound of the TAE reconstruction loss is given by,


Now we consider how good our lower bound is. Due to the universal approximation theorem Hassoun (1995); Chen and Chen (1995), for any , there exists a finite size decoder neural network such that the following inequality holds,


which means can be made arbitrarily close to the “optimal reconstructed output” as given in Eq. 26 by employing a sufficiently large neural network.

Therefore Eq. .1 becomes


This indicates that the lower bound is actually quite tight, and it is a relatively good approximation of the TAE loss, yielding,


corresponding to Eq. 14 in the main text.

What are the optimal encoding and the corresponding minimal possible loss for the TAE? From Eq. .1 we see that if , the optimal encoding should maximize . We define a finite set that includes all configurations mapped to the same encoding value under encoding mapping . If is a bijective encoding of , then is,


where is the number of frames with configuration and is the configuration corresponding to . For any encoding , we have,