I Introduction
The spectral analysis of transfer operators such as the Perron–Frobenius or Koopman operator is by now a wellestablished technique in molecular conformation analysis Schuette1999 ; pande2010everything ; msm_milestoning ; CN14
. The goal is to estimate—typically from long molecular dynamics trajectories—the slow conformational changes of molecules. These slow transitions are critical for a better understanding of the functioning of peptides and proteins. Transfer operators propagate probability densities or observables. Since these operators are infinitedimensional, they are typically projected onto a space spanned by a set of predefined basis functions. The integrals required for this projection can be estimated from training data, resulting in methods such as
extended dynamic mode decomposition (EDMD) WKR15 ; KKS16 or the variational approach of conformation dynamics (VAC) NoNu13 ; NKPMN14 . For a comparison of these methods, see Ref. KNKWKSN18, . A similar framework has been introduced by the machine learning community. Probability densities as well as conditional probability densities can be embedded into a reproducing kernel Hilbert space (RKHS) MFSS16 . These embedded probability densities are then mapped forward by a conditional mean embedding operator, which can be viewed as an embedded Perron–Frobenius operator KSM17 . An approximation of this operator in the RKHS can be estimated from training data, leading to methods that are closely related to EDMD and its variants. In the same way, this approach can be extended to the conventional Perron–Frobenius and Koopman operator, under the assumption that the probability densities or observables are functions in the RKHS KSM17 .In this paper, we will derive different kernelbased empirical estimates of transfer operators. There are basically two ways to obtain a finitedimensional approximation of such operators: The first one explicitly uses the feature map representation of the kernel, the second is based on kernel evaluations only. The advantage of the former is that the size of the resulting eigenvalue problem depends only on the size of the feature space, but not on the size of the training data set (this corresponds to EDMD or VAC). However, this approach can in general not be applied to the typically highdimensional systems prevalent in molecular dynamics due to the curse of dimensionality and furthermore requires an explicit feature space representation, i.e., an explicit basis of the approximation space. For the kernelbased variant, the size of the eigenvalue problem is independent of the number of basis functions—and thus allows for implicitly infinitedimensional feature spaces—, but depends on the size of the training data set (this corresponds to kernel EDMD or kernel TICA). Kernelbased methods thus promise increased performance and accuracy in transfer operatorbased conformation analysis. The framework presented within this work allows us to derive both approaches (classical and kernelbased) as different approximations of kernel transfer operators and thus subsumes the aforementioned methods.
We will show that the eigenvalue problem for a given transfer operator can be transformed by rewriting eigenfunctions in terms of the integral operator associated with the kernel. The collocationbased discretization of this modified problem formulation then directly results in purely kernelbased methods, without having to
kernelize existing methods. (By kernelizewe mean formally rewriting an algorithm in terms of kernel evaluations only.) Furthermore, we show that for stochastic systems it is possible to circumvent the drawback that the amount of usable training data is limited by the maximum size of the eigenvalue problem that can be solved numerically using averaging techniques, which can result in more accurate and smoother eigenfunctions. We will apply the derived algorithms to realistic molecular dynamics data and analyze the alanine dipeptide and the protein NTL9. The conformations obtained for NTL9 are comparable to results computed using deep neural networks, see Ref.
Noe_VAMPnets2018,, and demonstrate that our approach is competitive with stateoftheart deep learning methods.
The remainder of this paper is structured as follows: In Section II, we will briefly introduce transfer operators, reproducing kernel Hilbert spaces, and the integral operator associated with the kernel. We will then show in Section III that methods to estimate eigenfunctions of transfer operators from training data can be derived using the integral operator associated with the kernel. Section IV illustrates that it is possible to obtain accurate eigenfunction approximations even for highdimensional molecular dynamics problems. Future work and open questions are discussed in Section V. The appendix contains detailed derivations of the methods, which are based on the Mercer feature space representation of a kernel, and proofs. Moreover, we will define kernel transfer operators and compare our proposed algorithms with other wellknown methods such as kernel EDMD and kernel TICA, which have been proposed in Refs. WRK15, ; SP15, .
Ii Prerequisites
In what follows, we will introduce transfer operators, kernels, and reproducing kernel Hilbert spaces. The notation used throughout the manuscript is summarized in Table 1.
number of snapshots  

number of basis functions  
kernel  
integral operator associated with  
eigenvalues and eigenfunctions of the integral operator  
Koopman and Perron–Frobenius operator  
kernel Koopman and Perron–Frobenius operator  
eigenvalues and eigenfunctions of transfer operators 
ii.1 Transfer operators
We briefly motivate the usefulness of transfer operators for conformational metastability analysis. For a more detailed introduction, we refer to Refs. LaMa94, ; KKS16, .
Let the molecular process be given by a stochastic process on some highdimensional state space , for instance, the space of Cartesian coordinates of all atoms of the system. One can imagine the system as a random walk in some highdimension potential energy landscape, i.e., could be described by an overdamped Langevin equation (see Example II.1), but we do not require this explicitly.
As a stochastic process, the pointwise evolution of is formally described by the system’s transition density function , which gives the probability to find the system at a point after some lag time , given that it started in at time 0. Using , we can express the evolution of arbitrary observables and densities under the dynamics using socalled transfer operators. Consider some observable of the system, and the value of this observable at some point . Then the expected value of this observable of the system that started in and evolved to time is given by
The operator is called the Koopman operator of the system. In addition to the lag time , it depends on the initial measure , which in our case will be either the Lebesgue measure or the system’s invariant measure (also known as Gibbs measure). This will lead to slightly different interpretations of the evolved observables (more on that later) and allow us to utilize data points sampled in different ways in the algorithms.
Similarly, we can describe the evolution of densities: Assume the initial distribution of the system is given by the density . Then the distribution of the evolved system at time is given by
where is called the Perron–Frobenius operator of the system. In the special case of being a deterministic system, i.e., with some invertible flow map , we get the simplified definitions
Even though describing the system’s dynamics by operators may seem overly abstract compared to for example an SDE or the associated transport equations, this formalism has proven useful for metastability analysis. In fact, the basis of a multitude of computational methods is the realization that all information about the longterm behavior of the dynamics is contained within certain eigenfunctions of the transfer operators. Consider for instance the eigenproblem of the Koopman operator
(1) 
where the solution is a function in a certain Hilbert space. Under mild conditions, all eigenvalues are real and bounded from above by 1. Crucially, eigenfunctions corresponding to eigenvalues close to 1 reveal the location of metastable sets in SS13 ; DJ99 ; KNKWKSN18 . Analogue results hold for the Perron Frobenius eigenvalue problem
(2) 
Example II.1.
We consider the process described by the overdamped Langevin equation, i.e., the stochastic differential equation (SDE)
with the quadruplewell potential and the inverse temperature . Here,
describes a standard Wiener process (white noise). The potential on the domain
is illustrated in Figure 1 (a).As demonstrated in Figure 1 (b)–(d), the location of the metastable sets in can be identified by the sign structure of the dominant eigenfunctions. This can be done algorithmically by applying a spectral clustering
algorithm such as PCCA+ to the eigenfunctions. The end goal is the construction of a Markov State Model (MSM), in which the main metastable states form the states of a Markov chain
SS13 . In this example, this model would be entirely described by a transition rate matrix. This is a vastly reduced model compared to the original twodimensinal SDE, and still preserve the statistics of the longtime metastable transitions.This system will be revisited in Section IV, where it will be analyzed with the newly derived kernelbased methods. Moreover, the application of PCCA+ to approximated eigenfunctions will be demonstrated in Sections IV.2 and IV.3.
Galerkin approximation of transfer operators
Formally, the operators and are defined on the function space , which is an infinitedimensional Hilbert space with inner product . The solution to the above eigenproblems are therefore again functions in . Thus, in order to be numerically solvable, the problems need to be discretized
, i.e., approximated by finite, vectorvalued eigenproblems. We demonstrate this with the aid of the Koopman eigenproblem (
1). For (2), the procedure works in a similar fashion.Consider a set of linearly independent functions , and all the functions that can be expressed by linear combinations of the these functions:
Each such function is uniquely defined by its coefficient vector . Together, these functions form an dimensional subspace of , called the Galerkin space, and denoted by in what follows.
If is a “rich” approximation space—particularly, if its dimension is high enough—it can reasonably be assumed that the eigenproblem (1) can be approximately solved by an element from :
(3) 
There now exist various ways to construct from (3) a welldefined vector–valued eigenproblem of the form
(4) 
for the coefficient vector of . Conceptually, however, the most important step for discretizing the eigenproblem has already been completed, namely by assuming that its (approximate) solution lies in the finitedimensional function space . Thus, the greatest influence on the final approximation error is the choice of the finite basis .
For example, dividing regularly into boxes,
, the characteristic functions
for can be used as Galerkin basis. In this case the matrix in (4) is the transition matrix between the boxes. Its elements describe the probability of the system to end up at in box after starting in , which can be estimated from simulation data.
Other commonlyused types of Galerkin basis functions include polynomials or trigonometric polynomials over up to a certain degree. The approximation quality strongly depends on the number of basis functions, determined for instance by the resolution of the box partition or the maximum degree of the polynomials. In particular, the Galerkin and similar approaches suffer from the curse of dimensionality, the phenomenon that the size of the basis needs to grow exponentially with the dimension in order to maintain a given approximation error. This renders the Galerkin approach prohibitively expensive for highdimensional problems. We will see below how kernel methods—by implicitly defining bases for the solution space—can mitigate these problems.
Remark II.2.
With the Galerkin approach the derivation of a vectorvalued eigenvalue problem (4) is quite straightforward once an explicit basis of the Galerkin space is given. However, in order for the kernelbased approach with its implicit basis to show its advantages, we need to keep working with the original operator formulation of the eigenproblems (1) and (2) for now. This will also enable us to systematically separate approximation errors that arise from restricting the solution space from errors that are due to finite sampling of the dynamics. In the end, however, we will again obtain a databased matrix approximation of the original eigenproblem.
ii.2 Kernels and their approximation spaces
Moving away from dynamical systems and their transfer operators for a moment, we introduce kernels and their reproducing kernel Hilbert spaces. This introduction is based on Refs.
Schoe01, ; StCh08, .Let be a positive definite function, called the kernel. Associated with each kernel is a unique space of functions , called the reproducing kernel Hilbert space (RKHS), that will serve as an analogue to the previouslyintroduced Galerkin space . However, the elements of in general cannot be written by a finite combination of predefined basis functions. Instead, the defining property of is that each can be linearly combined by evaluations of the kernel:
It is important to note that the set of evaluation points , as well as the number of elements of the index set , depends on the function . Thus, for each element of , an uncountable infinite number of “basis functions” are available to express it. It is therefore no surprise that the approximation space can be incomparably “richer” than the Galerkin approximation space. However, how rich the space ultimately is depends on the choice of the kernel.
Example II.3.
Examples of commonlyused kernels are:

[label=(), wide, labelwidth=!, labelindent=0pt]

Polynomial kernel of degree :
with . It can be shown that in this case is identical to the Galerkin space of polynomials over up to degree .

Gaussian kernel:
where is a parameter. This important kernel induces an infinitedimensional RKHS that is dense in StCh08 , i.e., a very rich approximation space.

Given functions , with , then
defines a kernel and is the Galerkin space spanned by the basis
A kernel can be interpreted as a similarity measure. In point (iii) in the example above, this similarity measure is just the standard Euclidean inner product between the dimensional features and of and .
However, the kernel is typically not defined by a collection of explicitly given features. Rather, each kernel implicitly defines a possibly infinite collection of features. Thus, by choosing the kernel as some similarity measure that is “relevant” to the molecular system at hand, one can formulate and solve the eigenvalue problem in the possibly infinitedimensional feature space with highly expressive features, without having to explicitly construct, compute, and discretize those features.
For example, the kernel may be based on comparing a large collection of unrelated but independently relevant observables at the points and , drawn from a database, or measure an appropriate distance between possibly highdimensional or unwieldy representations of and , for which however distances can more easily be computed. In Section IV.3, we will see how to construct a kernel from a distance measure based on the contact map representation of proteins. Moreover, as kernels do not necessarily require vectorial input data, a kernel based on the graph representation of a molecule can also be definedBorgwardt2005 ; Vishwanathan2010 .
Integral operators
In order to formulate eigenproblems in instead of the original space , we require the following transformation map:
The operator is called the integral operator. Although it looks like a transfer operator, these concepts are unrelated—there is no dynamics in the definition of
. Applying this integral operator to a function can be viewed as a transformation, similar to a Fourier transform. It is the infinitedimensional equivalent of a basis change.
For technical reasons, it is advantageous to think of the image space of not as but again as . We therefore additionally define the embedding operator
where is the identification of a function with its equivalence class in . The relationship is illustrated in the following diagram:
Moreover, every function is integrable and, in what follows, we will consider as a subset of , although this is technically not entirely correct since contains functions and equivalence classes.
Now, instead of discretizing the transfer operator eigenvalue problem directly, we will first rewrite it in terms of transformed eigenfunctions. A discretization of this reformulated eigenvalue problem will finally result in algorithms that require only evaluations of the kernel in given data points.
Iii Kernelbased discretization of eigenvalue problems
In this section, we derive algorithms for the numerical solution of transfer operator eigenvalue problems in reproducing kernel Hilbert spaces that rely only on kernel evaluations in sampling data. Summarized, the main advantages of this approach are:

[wide, labelwidth=!, labelindent=0pt]

Depending on the chosen kernel, the approximation space can in principle be infinitedimensional and dense in , allowing for accurate approximations of transfer operators and their eigenfunctions.

The eigenfunction approximation error will primarily depend on the number of data points (and the right choice of the kernel) instead of the system dimension. This makes the approach applicable even to highdimensional problems without triggering the curse of dimensionality.

Instead of explicitly constructing (highdimensional) feature spaces, kernels based on chemically relevant distance measures can be used that implicitly work in these spaces.
The derivations of the new algorithms are surprisingly simple once we assume the following:

[wide, labelwidth=!, labelindent=0pt]

The eigenfunctions are in . That is, we consider transfer operators restricted to . This is similar to the Galerkin approach, where we seek the solution to the eigenproblems only in a finitedimensional Galerkin space , with the difference that the RKHS can still be infinitedimensional. Depending on the kernel, this can be a strong restriction, e.g., if is a polynomial kernel of low order (whose RKHS is of low finite dimension), or a mild restriction, e.g., if is the Gaussian kernel since the resulting RKHS is then infinitedimensional and dense in .

For any eigenfunction , we can find a function such that . That is, is the embedding of into the RKHS. This technical requirement can in fact be shown to always be fulfilled under mild conditions, see Lemma A.6 in the appendix.
iii.1 Koopman operator for deterministic systems
We initially consider deterministic dynamical systems, i.e., . In order to obtain kernelbased methods, we will first rewrite the eigenvalue problem (1) based on the above assumptions. We now want to find functions such that
(5) 
and then set to obtain an eigenfunction. Using the definition of the Koopman operator for deterministic systems we get for the left hand side
With this the eigenproblem (5) becomes
(6) 
which has to be satisfied for all points .
We assume now that we have training data , , where the are distributed and . Typically, the data is given either by many short simulations or few long simulations of the system. In a first approximation step, we now relax the transformed eigenproblem (6), by requiring that it only needs to be satisfied in the data points:
for . Finally, in a second approximation step, we estimate the integrals also from the training data by Monte Carlo quadrature, i.e.,
for . With the Gram matrix and timelagged Gram matrix defined by and , this can be written in matrix form as
where . This matrix eigenvalue problem can now be solved numerically.
Discretizing a functionvalued equation by requiring it to hold only at specific evaluation points is called a collocation method in the numerical analysis of differential equations. To again obtain a solution that is defined in all of , we also replace the integral by its estimate
Here, can be seen as a rowvector of functions. In the machine learning community, is also called feature matrix, although it is technically not a matrix.
[backgroundcolor=boxback,hidealllines=true]
Algorithm III.1.
The proposed algorithm for approximating eigenfunctions of the Koopman operator can be summarized as:

Select a kernel and compute the Gram matrices and .

Solve the eigenvalue problem .

Set as the approximation to the original eigenproblem (1).
If one is only interested in evaluated at the data points, i.e., , it can be obtained by since is the th row of .
Remark III.2.
The matrix might be singular or close to singular so that the inverse does not exist or is illconditioned. (For the Gaussian kernel, is guaranteed to be regular provided that all points , are distinct, but it might still be illconditioned.) Thus, for reasons of numerical stability, the above eigenproblem is often replaced by its regularized version
(7) 
This corresponds to Tikhonov regularization and reduces overfitting, see Ref. Schoe01, .
Algorithm III.1 is identical to the kernel EDMD formulation proposed in Ref. WRK15, , derived for explicitly given finitedimensional feature spaces by kernelizing conventional EDMD. In contrast, we directly derived a solution to the eigenproblem in the RKHS and its approximation from data. Additionally, a derivation based on empirical estimates of kernel covariance and crosscovariance operators is described in Ref. KSM17, . More details about the feature space and relationships with the aforementioned methods can be found in Appendix A.
iii.2 Perron–Frobenius operator for deterministic systems
A collocationbased approximation of the Perron–Frobenius eigenproblem (2) for deterministic systems can be derived in a similar fashion. It can be shown that
(8) 
We again set , which results in . In order for this equation to be satisfied, the expression on the left must be a function in since we assume to be in . Applying to both sides leads to transformed functions in , we are basically just changing the coefficients of the series expansion, see Lemma A.6. This results in the following transformed, yet equivalent eigenvalue problem: We seek functions such that
and set . Exploiting (8), where , we can rewrite this as
which we discretize in the same way as above—collocation in , , and Monte Carlo approximation of the integrals—and obtain the discrete eigenvalue problem
(9) 
Here, . From this, an eigenfunction approximation can again be obtained by setting . Assuming that is regular, this can be simplified by substituting .
[backgroundcolor=boxback,hidealllines=true]
Algorithm III.3.
An eigenfunction of the Perron–Frobenius operator can be approximated as follows:

Select a kernel and compute the Gram matrices and .

Solve the eigenvalue problem .

Set as the approximation to the original eigenproblem (2).
If we are only interested in the eigenfunctions evaluated at the training data points, denoted again by , this leads to since the evaluated in all points results in as shown above. This is consistent with the algorithm for computing eigenfunctions of the Perron–Frobenius operator proposed in Ref. KSM17, . To avoid overfitting, is typically regularized as before.
iii.3 Extension to stochastic systems
In practice, the methods introduced above can be applied to stochastic dynamical systems as well, by heuristically treating data pairs
that are in fact realizations of a stochastic dynamics as generated by a deterministic system. In that situation, however, a large number of data pairs are required in order to compensate for the variance in the stochastic dynamics. The number
of data points that can be taken into account is however limited due to memory constraints given by the Gram matrices. To address this, we derive outcomeaveraged Gram matrices, denoted by and , that take into account the stochasticity of the system in a more efficient way.For the stochastic case, the Koopman operator applied to the kernel takes the form
The integral can be approximated by Monte Carlo quadrature
(10) 
where the are endpoints of independent realizations of the dynamics all starting in .
As in Section III.1, we discretize (5) by Monte Carlo quadrature and demand it to hold only in data points , to obtain
Inserting the new approximation (10) of the stochastic Koopman operator leads to
where the , are independent realizations of the dynamics starting in . With the matrix
(11) 
called the averaged Gram matrix, this can be written in matrix form as
For the Perron–Frobenius operator, the procedure can be repeated in a similar fashion, which then leads to the matrix eigenproblem
with the averaged Gram matrix . Thus, we can again use the Algorithms III.1 and III.3 defined above, with the only difference that we replace the timelagged Gram matrices by the corresponding averaged counterparts. If we choose , we obtain the standard algorithms as a special case, also for stochastic systems.
iii.4 Approximation from timeseries data
In order to compute the averaged Gram matrices (11), many trajectories of length per test point must be generated, which is straightforward for simple systems, but might be cumbersome or infeasible for complex molecular systems where the data is in general only given in form of a long trajectory. In this situation, multiple realization per starting point are not available. We thus propose a simple heuristic method to utilize such trajectory data. The practicality of the method compared to “standard” kernel EDMD will be evaluated experimentally in Sections IV.1 and IV.2.
Let—in addition to the test points —a much larger data set and its timeevolution be given. Typically, these data sets come from a long equilibrated trajectory. Define the trajectoryaveraged Gram matrix by
(12) 
with a certain weight function . That way, not only trajectories that start exactly at , i.e., data pairs , contribute to the th entry of the Gram matrix, but also data pairs where lies near . What constitutes “near” is defined by the weight function , which should decrease with increasing distance and take its maximum for . Throughout the remainder of the paper, we will use the Gaussian weight function
where is a normalization factor. The bandwidth —similar to the parameter for the Gaussian kernel in Example II.3—controls the influence of surrounding points and has to be chosen problemdependently. Too large values result in a high contribution from dynamically unrelated points far away, whereas too small values may underutilize valuable dynamical information from points close by, decreasing the efficiency of the method.
Also note that this rather adhoc approach requires the computation of the pairwise distances between the test points and trajectory points , which may substantially increase the computational requirement compared to the standard algorithm. The usage of a cutoff radius or a locallysupported weight functions would help mitigate this.
iii.5 Additional results
For the derivations above, we assumed that the eigenfunctions are elements of the RKHS , which is in general not the case. For a polynomial kernel, for instance, the assumption would imply that all eigenfunctions can be written as polynomials, which is clearly not valid and introduces an approximation error. In Appendix A.4, we introduce the notion of kernel transfer operators, which can be regarded as approximations of transfer operators in the respective RKHS. Existing methods such as EDMD and VAC and their kernelbased counterparts can be easily interpreted as discretizations of kernel transfer operators. This is described in Appendices A.5 and A.6, respectively. Furthermore, kernel transfer operators allow for a detailed error analysis with respect to the choice of the kernel, which, however, is beyond the scope of this work.
Iv Applications
We will now demonstrate the application of the introduced algorithms to molecular dynamics problems. Similar examples, analyzed with the aid of related datadriven methods, can be found in Refs. SP13, ; NKPMN14, ; SP15, ; KNKWKSN18, ; Noe_VAMPnets2018, , for instance.
We will refer to the algorithms derived in Sections III.1 and III.2 as kernel EDMD for the Koopman and PerronFrobenius operator, respectively, the stochastic extension proposed in Section III.3 as stochastic kernel EDMD and the trajectorybased extension from Section III.4 as trajectoryaveraged kernel EDMD.
iv.1 Quadruplewell problem
Here we revisit the system from Example II.1. We want to compute the eigenfunctions of the Perron–Frobenius operator associated with this system using the various methods derived in the previous section.
Uniform distribution of test points
First, we choose a box discretization of and use the centers of the boxes as our test points. That is,
is sampled from a uniform distribution which corresponds to computing eigenfunctions of the Perron–Frobenius operator
with respect to the Lebesgue measure. As a side remark, this is equivalent to computing eigenfunctions of the Koopman operator with respect to the Gibbs measure, so the results in this section should be compared to the eigenfunctions in Figure 1 (d). We select the lag time and choose the Gaussian kernel with and regularization parameter (see below).We first apply the standard Algorithm III.3, i.e., assemble the Gram matrix using only one evolved point for each test point
. The resulting eigenvalues and normalized eigenfunctions (computed at all grid points and interpolated in between) are shown in Figure
2 (a). The eigenfunctions clearly reveal the expected four metastable sets. Between the fourth and fifth eigenvalue, there is a spectral gap.(a)  

(b) 
However, recalling that the analytic eigenfunction is supposed to be identical to the invariant density , it is clear that the approximation quality is quite low. We thus replace the standard Gram matrix by the averaged Gram matrix with realizations of the dynamics per test point. The resulting eigenfunctions can be seen in Figure 2 (b). The invariant density is visually indistinguishable from . The dependency of the error, , on is shown in Figure 3. The descend is explained by the better MonteCarlo approximation of the expectation value (10).
Test points from one long trajectory
In the same way, the eigenfunctions of the Perron–Frobenius operator with respect to the invariant density can be estimated using Algorithm III.3. To this end, we compute one long trajectory and downsample the data to obtain the same lag time as before resulting in a data set containing again 2500 data points. The results are shown in Figure 4 (a). The difference is that the eigenfunctions are not weighted by the equilibrium density anymore as it was the case in Figure 2. Now, the eigenfunctions are almost constant within the four quadrants with sharp transitions between the positive and negative regions. Note that the eigenfunctions of the Perron–Frobenius operator with respect to the Gibbs measure and the eigenfunctions of the Koopman operator with respect to the Lebesgue measure are identical, so Figure 4 should be compared to Figure 1 (c).
(a)  

(b) 
While the approximation quality appears to be adequate (the metastable sets are clearly distinguishable by the sign structure of the ), we observe some noise in the supposedly constant quadrants. Hence, we attempt to reduce this noise by using more dynamic information about the stochastic process instead of just the 2500 data pairs . We thus compute the trajectoryaveraged Gram matrix (12) using a long trajectory of frames. The distance parameter in the weight function was chosen as . The result is shown in Figure 4 (b). We observe a small improvement of the results. The dependency of the deviation of the first eigenfunction from the constant function on the trajectory length is shown in in Figure 5. The stagnation of the error can be delayed by increasing the number of test points.
iv.2 Alanine dipeptide
We now use kernel EDMD to analyze the global dynamics of the alanine dipeptide, see Figure 6 (a). It is well known that the global conformational shape of the peptide can be described by the values of only two dihedral angles located in the central backbone . A Ramachandran plot, Figure 6 (b), identifies four combinations of these angles that are metastable.
(a)  (b) 
The molecule consists of atoms, including hydrogen. A global analysis of the corresponding dimensional systems by conventional gridbased methods such as EDMD is thus out of the question. Instead, we apply the kernel EDMD method for the transfer operator to make the effort conditional on the number of snapshots instead of the number of basis functions. The number of test points was chosen to be , with starting points chosen from a 40 ns long trajectory that was generated with the Gromacs molecular dynamics softwareGromacs . Thus, the sampling measure is the Gibbs measure. The lag time was defined to be ps. In order to make best use of the remaining trajectory data, we employ the trajectoryaveraged kernel EDMD method detailed in Section III.4 on a frame subsample of the trajectory to assemble the Gram matrix.
We again use the Gaussian kernel. An appropriate choice of the parameter values of (kernel bandwidth) and (distance parameter in trajectory averaging) is critical for the performance of the method. While there have been investigations on the optimal choice of the kernel bandwidthSinger2006 , we are unaware of good strategies for choosing a priori. We thus experimentally examine the influence of the two parameters in more detail. To this end, we postprocess for different
combinations the computed eigenvectors with the PCCA+algorithm
DeuWe05 ; Roeblitz2013 ; Web18 in order to find four maximally metastable membership functions, i.e., functions , that assign each test point its probability to belong to one of four metastable states. This is also called a fuzzy clustering of the test points. For an indepth introduction to PCCA+ and the relation between membership functions and transfer operator eigenfunctions, we refer to Ref. Roeblitz2013, .Good membership functions are “unambiguous”, that is, each membership function assigns each test point a membership value close to either zero or one. The objective function
in what follows called the crispness of the fuzzy clustering, reflects that and can be thus used as a quality metric of the membership functions and their originating eigenfunctions.
The value of for different combinations of and is shown in Figure 7. No regularization has been used to avoid too many parameter dependencies. We observe a maximum value of at , . A spectral gap is present after the fourth eigenvalue. The associated four membership functions projected onto the space are shown in Figure 8 (a) and clearly indicate the four expected metastable sets. A discrete (“hard”) clustering can be constructed by assigning each test point the index of the membership function of maximal value and is shown in Figure 8 (b). These four clusters would then form the four states of an MSM.
As a remark, the amount of noise in both the fuzzy and hard clustering can be significantly reduced by applying just a slight regularization, e.g., , to the eigenproblem. Assembling the trajectoryaveraged Gram matrix took s on an eightcore desktop workstation, with the computation of the distances between the test points and trajectory points being the main contributor.
In comparison, the standard kernel EDMD algorithm without trajectory averaging, i.e., using only the test points and single realizations to assemble the Gram matrix, did not result in interpretable eigen and membership functions when using the same kernel parameter and no regularization. However, increasing the kernel parameter to and using regularization with parameter again results in a spectral gap after and membership functions that clearly indicate the metastable sets (Figure 8 (c)). Although the crispness value is lower than for the trajectory–averaged case (), the discrete clustering is less noisy, as the broader kernel and regularization have a smoothing effect on the eigenvectors. Moreover, assembling the standard Gram matrix is significantly faster, taking only s.
In conclusion, also for realistic MD problems, the heuristic trajectory averaging method results in a measurable improvement of the (fuzzy) spectral clustering if the associated influence parameter can be chosen appropriately, in addition to the kernel bandwidth . However, if one is interested only in a discrete clustering of the states, for instance to subsequently build an MSM, the standard kernel EDMD method with increased kernel and smoothing parameter yields comparable results at significantly reduced computing cost.
(a)  (c)  
(b)  (d) 
iv.3 Ntl9
Finally, we apply (standard) kernel EDMD to the fastfolding protein NTL9. The molecule consists of 301 heavy atoms (624 atoms overall), divided into 40 residues. As the source of dynamical data we utilize an allatom simulation of total length 1.11 ms that was performed on the Anton supercomputer Shaw_QuickFolding2011 . In order to best capture the system’s internal dynamics, we use a Gaussian kernel based on the contact map distance between snapshots, i.e., the kernel
(13) 
where is the contact map of . The contact map is the matrix containing the distances of each pair of backbone residues (i.e., the minimum distance between the atoms in the respective residues). Thus, using this kernel is equivalent to converting the data into a dimensional state space, which cannot be discretized by traditional gridbased Galerkin methods.
To obtain the data matrix , the trajectory is subsampled with a time step of 100 ns. Likewise, the matrix is subsampled from the trajectory, only with an offset of 50 ns from . The rows of can thus be interpreted as endpoints of single realizations of length with the starting points being the rows of . The contact maps are computed using the Gromacs softwareGromacs .
We assemble the Gram matrices and for different values of and solve the regularized eigenvalue problem (7) with regularization parameter . Figure 9 (a) shows the 15 dominant eigenvalues for varying values of . According to the variational approach of conformational dynamics, higher eigenvalues are in general associated with a better approximation of the original eigenproblem, as discretizations always underestimate the original eigenvalues NoNu13 . Thus, the mean value of the dominant eigenvalue can be used as an alternative objective function for choosing the optimal kernel bandwidth. For NTL9, the maximum is observed at . Thus, we choose this bandwidth for the further analysis.
(a)  (b) 

A spectral gap can be observed after the second eigenvalue (Figure 9 (b)), indicating that the data set can be divided into two metastable sets. Indeed, clustering the two dominant eigenvectors into two sets using the PCCA+ algorithm as described in the previous section leads to the two conformations shown in Figure 10 (a). The conformations coincide very well with the folded and unfolded states identified in Ref. Noe_VAMPnets2018, , where the same data set was analyzed using deep learning methods. We thus investigate whether we can also reproduce the second hierarchy of metastable conformations identified in Ref. Noe_VAMPnets2018, by analyzing the remaining computed eigenfunctions. The result of clustering the eigenvectors into five states using the PCCA+ algorithm can be seen in Figure 10 (b). Four of the five conformations identified in Ref. Noe_VAMPnets2018, could be accurately reproduced by the kernel EDMD algorithm together with PCCA+. The absence of the fifth conformation (named “Intermediate” in Ref. Noe_VAMPnets2018, ) can be explained by the fact that our method uses a factor of ten less data points and that the missing conformation is small (0.6% of overall states), thus easy to undersample.
(a) Two state clustering
Folded (88.0%)  Unfolded (12.0%) 
(b) Five state clustering
Folded 1 (82.2%)  Folded 2 (6.8%)  Unfolded 1 (8.3%)  Unfolded 2 (2.6%)  Misfolded (0.06%) 

V Conclusion
We have shown how the kernelbased discretization of transfer operator eigendecomposition problems leads to efficient and purely datadriven methods for the analysis of molecular conformation dynamics. The derived methods are strongly related to other wellknown approaches such as kernel EDMD, kernel TICA, and empirical estimates of kernel transfer operators (as well as conditional mean embeddings). Furthermore, we have proposed an extension to stochastic dynamical systems that is based on averaged Gram matrices and shown that these techniques lead to more accurate spectral decompositions than previous kernelbased approaches. The methods have been applied to realistic molecular dynamics simulations and compared with deep learning techniques, illustrating that kernelbased methods are able to tackle highdimensional problems for which classical discretization methods fail. Moreover, these methods can also be applied to nonstandard representation of molecules, such as contact maps or graph representations of proteins. This enables the user to define similarity measures tailored to specific problems by exploiting domain knowledge about the system.
One possibility to extend the approach presented here would be to combine it with deep kernel learningwilson2016deep
or other machine learning techniques. Both deep learning and RKHSbased methods are very powerful and their combination might help mitigate the curse of dimensionality typically associated with the approximation of transfer operators pertaining to highdimensional systems. For nonequilibrium systems, a singular value decomposition might be advantageous. The SVD of empirical estimates of kernel transfer operators (and more general empirical RKHS operators) was recently proposed in Ref.
MSKS18, and might have additional applications such as lowrank approximation of operators or computing pseudoinverses of operators. Moreover, Gaussian processes might be beneficial to include also uncertainties in these algorithms.Acknowledgements
This research has been partially funded by Deutsche Forschungsgemeinschaft (DFG) through grant CRC 1114 (Projects B03 and B06). We would like to thank D. E. Shaw for the NTL9 data set.
References
 [1] C. Schütte, A. Fischer, W. Huisinga, and P. Deuflhard. A Direct Approach to Conformational Dynamics Based on Hybrid Monte Carlo. J. Comput. Phys., 151(1):146–168, 1999.
 [2] V. S. Pande, K. Beauchamp, and G. R. Bowman. Everything you wanted to know about Markov State Models but were afraid to ask. Methods, 52(1):99–105, 2010.
 [3] C. Schütte, F. Noé, J. Lu, M. Sarich, and E. VandenEijnden. Markov state models based on milestoning. J. Chem. Phys., 134(20):204105, 2011.
 [4] J. D. Chodera and F. Noé. Markov state models of biomolecular conformational dynamics. Curr. Opin. Struct. Biol., 25:135–144, 2014.
 [5] M. O. Williams, I. G. Kevrekidis, and C. W. Rowley. A datadriven approximation of the Koopman operator: Extending dynamic mode decomposition. Journal of Nonlinear Science, 25(6):1307–1346, 2015.
 [6] S. Klus, P. Koltai, and C. Schütte. On the numerical approximation of the Perron–Frobenius and Koopman operator. Journal of Computational Dynamics, 3(1):51–79, 2016.
 [7] F. Noé and F. Nüske. A variational approach to modeling slow processes in stochastic dynamical systems. Multiscale Modeling & Simulation, 11(2):635–655, 2013.
 [8] F. Nüske, B. G. Keller, G. PérezHernández, A. S. J. S. Mey, and F. Noé. Variational approach to molecular kinetics. Journal of Chemical Theory and Computation, 10(4):1739–1752, 2014.
 [9] S. Klus, F. Nüske, P. Koltai, H. Wu, I. Kevrekidis, Ch. Schütte, and F. Noé. Datadriven model reduction and transfer operator approximation. Journal of Nonlinear Science, 2018.
 [10] K. Muandet, K. Fukumizu, B. Sriperumbudur, and B. Schölkopf. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends in Machine Learning, 10(1–2):1–141, 2017.
 [11] S. Klus, I. Schuster, and K. Muandet. Eigendecompositions of transfer operators in reproducing kernel Hilbert spaces. ArXiv eprints, 2017.
 [12] A. Mardt, L. Pasquali, H. Wu, and F. Noé. VAMPnets for deep learning of molecular kinetics. Nature Communications, 9, 2018.
 [13] M. O. Williams, C. W. Rowley, and I. G. Kevrekidis. A kernelbased method for datadriven Koopman spectral analysis. Journal of Computational Dynamics, 2(2):247–265, 2015.
 [14] C. R. Schwantes and V. S. Pande. Modeling molecular kinetics with tICA and the kernel trick. Journal of Chemical Theory and Computation, 11(2):600–608, 2015.
 [15] A. Lasota and M. C. Mackey. Chaos, fractals, and noise: Stochastic aspects of dynamics, volume 97 of Applied Mathematical Sciences. Springer, 2nd edition, 1994.
 [16] C. Schütte and M. Sarich. Metastability and Markov State Models in Molecular Dynamics: Modeling, Analysis, Algorithmic Approaches. Number 24 in Courant Lecture Notes. American Mathematical Society, 2013.
 [17] M. Dellnitz and O. Junge. On the approximation of complicated dynamical behavior. SIAM Journal on Numerical Analysis, 36(2):491–515, 1999.

[18]
B. Schölkopf and A. J. Smola.
Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond
. MIT press, Cambridge, USA, 2001.  [19] I. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008.
 [20] K. M. Borgwardt, C. S. Ong, S. Schönauer, S.V.N. Vishwanathan, A. J. Smola, and H.P. Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21(suppl_1):i47–i56, 2005.
 [21] S. V. N. Vishwanathan, N. N. Schraudolph, R. Kondor, and K. M. Borgwardt. Graph kernels. Journal of Machine Learning Research, 11(Apr):1201–1242, 2010.
 [22] C. R. Schwantes and V. S. Pande. Improvements in Markov State Model construction reveal many nonnative interactions in the folding of NTL9. Journal of Chemical Theory and Computation, 9:2000–2009, 2013.
 [23] H.J.C. Berendsen, D. van der Spoel, and R. van Drunen. Gromacs: A messagepassing parallel molecular dynamics implementation. Computer Physics Communications, 91(1):43 – 56, 1995.
 [24] A. Singer. From graph to manifold laplacian: The convergence rate. Applied and Computational Harmonic Analysis, 21(1):128 – 134, 2006. Special Issue: Diffusion Maps and Wavelets.

[25]
P. Deuflhard and M. Weber.
Robust Perron cluster analysis in conformation dynamics.
Linear Algebra and its Applications, 398:161 – 184, 2005. Special Issue on Matrices and Mathematical Biology.  [26] S. Röblitz and M. Weber. Fuzzy spectral clustering by PCCA+: application to markov state models and data classification. Advances in Data Analysis and Classification, 7(2):147–179, Jun 2013.
 [27] M. Weber. Implications of PCCA+ in molecular simulation. Computation, 6(1), 2018.
 [28] K. LindorffLarsen, S. Piana, R. O. Dror, and D. E. Shaw. How fastfolding proteins fold. Science, 334(6055):517–520, 2011.
 [29] A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing. Deep kernel learning. In Artificial Intelligence and Statistics, pages 370–378, 2016.
 [30] M. Mollenhauer, I. Schuster, S. Klus, and C. Schütte. Singular value decomposition of operators on reproducing kernel Hilbert spaces. ArXiv eprints, 2018.
 [31] C. Baker. Joint measures and crosscovariance operators. Transactions of the American Mathematical Society, 186:273–289, 1973.

[32]
K. Fukumizu, F. R. Bach, and M. I. Jordan.
Dimensionality reduction for supervised learning with Reproducing Kernel Hilbert Spaces.
Journal of Machine Learning Research, 5:73–99, 2004.  [33] M. Korda and I. Mezić. On convergence of Extended Dynamic Mode Decomposition to the Koopman operator. ArXiv eprints, 2017.
Appendix A Detailed derivations and relationships with other methods
We will now present an alternative derivation of the operators derived in Section III and highlight relationships with other wellknown techniques to approximate eigenfunctions of transfer operators, see also Refs. 9, 11. To this end, we first introduce kernel transfer operators. Different discretizations of these operators then result in methods such as EDMD or VAC and their kernelbased counterparts. Markov State Models (MSMs), in turn, can be seen as a special case of EDMD where the basis contains indicator functions for a partition of the state space.
a.1 Mercer feature space
For the following proofs, we will need the Mercer feature space representation of the kernel, which can be constructed by considering the eigenvalues and eigenfunctions of the integral operator associated with the kernel . The derivation of this representation is mainly based on Ref. 19.
There exist an at most countable orthonormal system of eigenfunctions and positive eigenvalues of such that
for any . We assume the eigenvalues to be sorted in nonincreasing order. For the sake of simplicity, let in what follows.
Theorem A.1 (Mercer’s theorem).
For , it holds that
where—for infinitedimensional feature spaces—the convergence is absolute and uniform [18].
We can then construct the corresponding RKHS explicitly.
Theorem A.2.
Let and , with and . Define
then with this inner product is the RKHS associated with the kernel .
Note that the kernel defined above indeed has the reproducing property: Given a function , we obtain using the definition of the inner product for
We define the projection of a function onto the reproducing kernel Hilbert space as follows:
Definition A.3 (Orthogonal projection).
Given a function , the orthogonal projection onto is defined as
a.2 Covariance operators
Let be the Mercer feature space representation associated with the kernel . Instead of writing , we will also use the shorter form , even though might be a (countably) infinite set.
Definition A.4 (Covariance operators[31, 10]).
Given . The covariance operator and the crosscovariance operator are defined as
where
and
and
denotes the tensor product.
For vectorvalued functions, the Koopman operator is defined to act componentwise. In Refs. 10, 11, the covariance operator is defined with respect to the joint probability measure. However, note that due to the definition of the Koopman operator this is equivalent since . Given training data , , as above, we define
The empirical estimates of the matrix representations of the operators are given by
Remark A.5.
Note that for the feature matrix in Section III we used the canonical feature map , whereas we now use the Mercer feature map , where is an element in a potentially infinitedimensional vector space. As long as our algorithms rely only on kernel evaluations, it does not matter which feature space we consider since for both and the resulting Gram matrices are identical.
a.3 Properties of the integral operator
The following lemma summarizes useful properties of the different operators.
Lemma A.6.
It holds that:

[label=()]

for .

for .

Any function can be written as with provided that .
Proof.
The properties follow directly from the series expansion of .
(i) The matrix is a diagonal matrix whose entries are the eigenvalues of the integral operator since
so that .
(ii) and is a mapping from to .
(iii) For , define , where .
∎
It follows in particular that for .
Remark A.7.
For infinitedimensional feature spaces, the assumption will not be satisfied for all . By setting for , where is fixed, the equation can, however, be satisfied approximately. In practice, since only finitely many data points are considered, we can always construct a finitedimensional feature map (e.g., the datadependent kernel map, see Ref. 18).
a.4 Kernel transfer operators
Kernel transfer operators, which were introduced in Ref. 11, can be regarded as approximations of transfer operators in RKHSs. The underlying assumption is that both the densities or observables and the densities or observables propagated by the Perron–Frobenius or Koopman operator, respectively, are functions in the RKHS , which, in general, is not the case. Empirical estimates of these operators result, under certain conditions, in methods such as EDMD [5] or VAC [8] or their kernelbased counterparts [13, 14]. Let us start with a brief description of kernel transfer operators, see also Ref. 11. The derivation here is slightly different in that we directly use the Mercer feature space representation. The goal is to illustrate connections with Galerkin approximations.
Proposition A.8.
Let with and , then
Proof.
We show the second part, the first one follows analogously:
Similarly,
where we used the definition of the inner product in in the last step. ∎
Proposition A.9.
Assuming that for all , it holds that .
The proof—formulated in terms of expectation values rather than transfer operators—can be found in Ref. 32. The assumption corresponds to being an invariant subspace of the Koopman operator. If is invertible, we can immediately define the kernel Koopman operator as . Otherwise, the regularized version is used. An analogous result can be obtained for the Perron–Frobenius operator.
Proposition A.10.
Assuming that for all , we obtain .
Proof.
A similar result was shown in Ref. 11. It follows that we can define the kernel Perron–Frobenius operator as , where regularization might be required as above. In general, however, will not be invariant under the action of the Koopman or Perron–Frobenius operator.
a.5 EDMD and VAC
In what follows, we will analyze the convergence properties of the empirical estimates of the kernel transfer operators, given by
and
It is important to remark that these operator approximations can only be computed numerically if the feature space of the kernel is finitedimensional. However, the resulting eigenvalue problems can be reformulated in such a way that only kernel evaluations are required. The feature space representation is then never computed explicitly, but only implicitly through the kernel. Assuming that the feature space is finitedimensional, we obtain the following algorithm to estimate eigenfunctions of transfer operators, where denotes the Hermitian transpose.
[backgroundcolor=boxback,hidealllines=true]
Algorithm A.11.
In order to approximate eigenfunctions of the Koopman operator:

Select a set of basis functions and compute the matrices and .

Solve the eigenvalue problem .

Set .
For the Perron–Frobenius operator, it suffices to replace by as described above. These empirical estimates are equivalent to VAC and EDMD, see also Ref. 11. That is, in the infinitedata limit, we obtain the Galerkin approximation of the corresponding operator in the RKHS , cf. Refs. 5, 6. Note that for EDMD does not necessarily have to be the Mercer feature space representation of the kernel. The Mercer feature space is orthogonal with respect to the inner product in so that the empirical estimate of the covariance matrix converges to and its inverse is simply . The convergence of EDMD to the Koopman operator for , where is the dimension of the state space, i.e., the cardinality of the index set , was first studied in Ref. 33.
The equivalence of the above algorithm and the methods derived in Section III—from a linear algebra point of view—was shown in Ref. 13, where EDMD for the Koopman operator was kernelized to obtain kernel EDMD. Similarly, equivalence of the different empirical RKHS operator eigenvalue problems—from a functional analytic point of view—was shown in Ref. 30.
a.6 Kernel EDMD and kernel TICA
In Ref. 13, kernel EDMD for the Koopman operator was derived by kernelizing standard EDMD. A different derivation based on kernel transfer operators and embedded kernel transfer operators was presented in Ref. 11. For , which corresponds to no averaging to compute timelagged Gram matrices, the methods derived in Section III are identical to (extensions of) kernel EDMD. The main difference is that in Section III we directly discretized the transfer operator eigenvalue problem. This allows for exploiting the stochasticity of the system to improve the numerical stability of kernelbased approaches for molecular dynamics problems and other stochastic dynamical systems.
Similarly, our approach is also related to kernel TICA, which was proposed in Ref. 14. However, for the derivation of kernel TICA the system is assumed to be reversible. In order to obtain a real spectrum, the trajectory and timereversed trajectory are used for the approximation. This can be accomplished by defining feature matrices and and applying the kernelbased methods to this new data set, which doubles the size of the resulting eigenvalue problem. Furthermore, their derivation is based on a variational principle and leads to a slightly different problem, which—using our notation—can be written as
where is the augmented Gram matrix given by
Note that for the augmented Gram matrices
Thus, this is equivalent to (9) for the timereversed system. As before, for large enough , the regularization term guarantees that the matrix on the right will be positive definite.