One of the long-standing issues in modeling dynamical systems is model error arises from incomplete understanding of the physics. The progress in tacking this problem goes under different names depending on the scientific fields. In applied mathematics and engineering sciences, some of these approaches are known as the reduced-order modeling, which ultimate goal is to derive effective models from the first principle, assuming that the full dynamics is known. They include the Mori-Zwanzig formalism [42, 49, 50] and its approximations [4, 5, 11, 15, 23]; the averaging/homogenization when there are apparent scale separation between the relevant and irrelevant variables [9, 40, 41, 47]. In domain sciences, various methods for subgrid-scale parameterization
were proposed to handle the same problem that arise in applications such as material science, molecular dynamics, climate dynamics, just to name a few. They include the Markov chain type modeling[6, 21]; stochastic parameterization [1, 6, 16, 27, 30, 32, 35, 48]; superparameterization in cloud modeling [12, 22, 34] and in combustion problems [18, 19]; Direct-Interaction Approximation (DIA) for parameterizing sub-grid scale processes in isotropic turbulence  and its extensions , for modeling non-Markovian memory in inhomogeneous turbulence over topography. We should point out that this list is incomplete and these approaches share some commonality despite being developed independently and have different implementation details. Namely, the key unifying theme in these aforementioned methods is the parametric modeling assumption with a specific class of functions/distributions and with finite number of parameters.
In this paper, we consider a nonparametric modeling framework to compensate for the missing dynamical components. In our setup, suppose that the underlying full dynamics is an ergodic system of Itô diffusion with relevant components and irrelevant components . The objective is to predict the evolution of and its statistics, given only the component of the dynamics,
and a historical data set . In (1), and denote the component of the drift and diffusion terms, respectively, and denotes the standard Wiener process.
While the core of the problem is similar to those considered in the reduced-order modeling framework, the fact that we have no knowledge of the full dynamics prohibits us to derive an effective equation from the first principle. Motivated by the practical applications where the underlying dynamics are not fully understood, instead, we will use the available historical data to reconstruct the missing dynamical components. We should point out that the restriction of knowing historical measurement of the irrelevant component, , can be relaxed in some cases. When is the only available measurement, one can use for e.g., likelihood maximum estimate [25, 31] in the deterministic case or an adaptive Bayesian filtering  (when is constant and the training data is noisy) to extract the “identifiable" components of . By identifiable components, we refer to variables that depend on that appear in and , as we shall see in our numerical examples. Abusing the notation, we will denote as the identifiable components. We will clarify this notion in our numerical examples.
Our approach is to construct a nonparametric representation for conditional density from the pair of historical time series with time lag , where we have defined with and for some integer . When , has only components (similarly for , has only components). Given a discrete approximation of this transition density, the proposed reduced-order model (or closure/parameterization) is given by,
where . Notice that if the component is slow and the missing component is fast with scale gap denoted by a small parameter , the construction is essentially the effective dynamics deduced by the averaging theory [20, 26, 43] when is replaced by the invariant density of the fast dynamics for a fixed , if such density exists. In this specific situation (fast-slow system), by setting , that is , our approach effectively closes the dynamics by averaging over , where denotes the invariant density of the full dynamics. We will show that averaging over is consistent with averaging over up to order . In general case where there are no separation of scales, the choice of will be problem dependent. When and/or are strictly positive, we refer to the conditional density as a non-Markovian transition density. In this situation, the predictive skill of certain statistics will depend on specific choices of . For example, in linear and Gaussian case without scale gap, we will show the existence of a non-Markovian transition density which allows (2) to accurately estimate one-point and two-point statistics of the -components of the full dynamics, regardless of the time scale gaps.
The main idea in this paper is to consider a nonparametric representation for
using the theory of kernel embedding of conditional distribution, which was introduced in the machine learning community[45, 46]. In a nutshell, the kernel embedding of conditional distribution represents the density with functions of the reproducing kernel Hilbert space (RKHS), . When is equipped with an orthonormal basis of appropriate space, any can be represented as , where the coefficients in will be pre-computed using the historical data points. In this paper, we will consider parametric orthonormal basis functions such as the Hermite polynomials for low-dimensional as well as the proper orthogonal decomposition (POD) modes for high-dimensional . In the latter case, we shall see that the resulting integral terms in (2
) is a parametric model that is well-known, namely the linear non-autonomous autoregressive models. In general, the form of the parametric closure model depends on the ansatz ofas a function of . We should point out that one can also leave it entirely nonparametric by using the data-driven basis functions constructed by the diffusion maps algorithm as in [3, 17]. While this is ideal, the construction of such basis functions requires an elaborate computational effort and is limited to problems with intrinsically low-dimensional . In addition to constructing the basis, the main computational cost arises when we need to evaluate the estimated basis functions on new points for future-time prediction. Given these constraints, we will not explore the data-driven nonparametric basis in this paper.
The remaining of the paper is organized as follows. In Section 2, we briefly review the theory of kernel embedding of conditional distribution for estimating using an orthogonal basis representation and discuss the proposed closure models in detail. In Section 3, we provide an intuition for choosing the density by discussing missing dynamics in a linear and Gaussian dynamics with and without temporal scale gaps. In Section 4, we numerically demonstrate the proposed approaches on two nonlinear high-dimensional test problems, where in the first example is low-dimensional and in the second example, is very high-dimensional. In Section 5, we conclude the paper with a brief summary and discussion. We include an Appendix that shows the consistency of the proposed approach in estimating autocovariance functions in linear and Gaussian case without scale gap.
2 A nonparametric formulation of modeling missing dynamics
In this section, we first review the kernel embedding of conditional formulation introduced in [45, 46], formulated using orthonormal basis of appropriate square-integrable function spaces as in [3, 17]. Subsequently, we present the proposed nonparametric modeling approach for missing dynamics.
2.1 Kernel embedding of conditional distribution
Let be a compact set and define
to be a kernel. The eigenfunctions
corresponding to eigenvalueof the following integral operator,
form an orthonormal basis of and the kernel can be written as,
Let be a reproducing kernel Hilbert spaces (RKHS) induced by such kernel, that is, subspace of with the reproducing property corresponding to inner product defined as , where and . Then for any and , we can represent
with basis of . Analogously, we define be a RKHS for functions of , which can be represented by orthonormal basis .
be random variables onand , respectively, with conditional distribution . The theory of kernel embedding of conditional distribution [45, 46], implemented using the bases above [3, 17] can be described as follows. The conditional density function , where , can be represented as,
where the components of the expansion are given as,
and the expectations
are taken with respect to the sampling densities of the training dataset. Notice that this representation can be understood as a linear regression in infinite-dimensional spaces with respect to basis functionsand . The representation in (4) is nonparametric in the sense that we do not assume any particular distribution for the density.
Given pairs of data , where , we can estimate these coefficients via Monte-Carlo averages:
We should point out that if the weight in is the sampling density of the data in , since is orthonormal under the corresponding inner product, then
is an identity matrix. While representation on this Hilbert space is desirable, finding the corresponding orthonormal basis for high-dimensionalis computationally challenging. In addition to constructing the basis, the main computational cost arises when we need to evaluate the estimated basis functions on new points for future-time prediction as shown in the next section. To avoid these expensive computations, we will adopt simpler basis functions, namely the Hermite polynomial basis for low-dimensional and the proper orthogonal decomposition (POD) basis for high-dimensional .
2.2 Modeling the missing dynamics
In the discussion below, we will just focus on the expectation of (the calculation for the expectation of will be similar). In our formulation, we set the weight in the Hilbert space to the the sampling density of the data in . In particular, substituting (4) into (7), we obtain,
can be pre-computed. In this derivation, the second line is due to Monte-Carlo average using data , the fourth line is using (5), and the last line is due to the truncation in the summation of the index up to order , and using the fact that,
whenever is orthonormal in , where the weight is exactly the sampling density of . Since the resulting coefficients in (9) are independent to , in practice, we only need to choose the basis .
Notice that the resulting averaged quantity in (8) arises from the proposed nonparametric formulation in (4) is a parametric model, where the parametric ansatz is determined by how depends on . For example, when is low-dimensional, we will consider Hermite polynomial basis functions for in a numerical example in Section 4.1. In this case, the resulting parametric model is a polynomial of degree and the coefficients in are directly estimated via the kernel embedding formula.
We should point out that when we use Hermite polynomial basis, we set the weight to be Gaussian with empirical mean and covariance determined empirically from the training data . In our numerics, we also employ a regularization replacing in (9), with small parameter to compensate for the conditional density that is not in (as suggested in [45, 46]). Basically, this regularization is the penalty of not building the appropriate RKHSs that respect the sampling distribution and geometry of the data.
For high-dimensional , we will consider using the proper orthogonal decomposition (POD) as a basis for . Conceptually, this choice of basis corresponds to using an empirical covariance as the kernel in (3) (see e.g., Chapter 5 of  for more detail discussion). Computationally, define a matrix , where the th row consists the training data and , such that its row sum is zero. In this case, the basis functions will be defined as the th column of the orthonormal matrix ,
is the singular value decomposition (SVD). These basis functions are called the Proper Orthogonal Decomposition (POD) modes or a discrete version of the Karhunen-Loève basis expansion (see e.g., Chapter 5 of).
From the orthonormality of , we have such that Eq. (8) can be further simplified,
where we used basis functions. Suppose that , then Eq. (11) can be equivalently rewritten in a matrix form as,
The formula in (13) is exactly a linear regression between observations and . This means that the nonparametric RKHS representation reduces to the parametric linear regression when POD basis are used to represent functions defined on the space. In the case where , , the resulting closure model in (13) is nothing but a linear autoregressive model for variable with a linear non-autonomous variable .
While the POD representation is convenient for high dimensional problems, we should point out these basis functions may not be adequate for systems with nonlinear and/or non-Gaussian nature. In fact, we will show in Section 4.2 that the POD basis representation is not sufficient to recover the missing terms in a nonlinear system even when the invariant density is close to Gaussian. In this case, we will find that an additional noise term can be used to compensate for the residual space (orthogonal to POD).
3 A linear and Gaussian example
In this section, we provide an intuitive argument for the choice of conditional density function in compensating the missing dynamical terms as proposed in (2). Specifically, we will build our intuition for choosing variables by studying the missing dynamics in an analytically tractable Gaussian linear problem with and without temporal scale gaps. That is, we consider a linear multi-scale dynamical model,
for a slow variable and a fast variable . Here, and are independent Wiener processes. The parameters and the eigenvalues of the matrix
are strictly negative, to assure the existence of a unique invariant joint density . The parameter characterizes the time-scale separation between variables and . Moreover, we assume the coefficient
to assure that the leading-order slow dynamics supports an invariant measure . In the limit of , the leading-order dynamics,
with as defined in (16
), is obtained by averaging the slow component of the vector field,, with respect to the invariant density of the fast dynamics in (15) for a fixed . For this simple example, it is clear that . The effective equation in (17) is deduced using the averaging theory [20, 26, 43], which basic idea is to approximate the density of the full dynamics as,
where denotes the evolution density corresponding to leading-order dynamics. First, we should point out that when the fast dynamics in (15) is not available, we have no information about the invariant density and we also cannot generate samples of this density. In our example above, is not computable since and are unknown.
The proposed model for the closure is motivated by the following observation. Taking in (18), the invariant density of the full dynamics can be approximated by that of the leading-order dynamic up to order-, that is, . Therefore,
This equation basically suggests that we can approximate with the following conditional density . For the simple linear example in (14)-(15), one can solve the Lyapunov equation of the full system for the equilibrium covariance matrix and deduce
. Expanding the mean and variance statistics in terms of, we obtain
which means that the order- expansion error in (19) is in the sense of the mean and variance.
which means the proposed closure obtained by averaging over is consistent (up to order- error) with the reduced model obtained from the classical averaging theory. In general, such an analytical expression will not be available and we will approximate the conditional density, , by applying the kernel embedding of conditional density theory discussed in the previous section on the training data set . In this case, it is clear that is the natural choice. In the remainder of this paper, we will refer to this closure model as the “RKHS ".
When there is no scale gap, i.e., is large, the approximation via the averaging theory is not valid and therefore averaging over will not work. In this case, let us consider such that the closure model is an average over a non-Markovian conditional density function . That is,
where the conditional average is evaluated at new data point for time lag interval , resulted from integration of this model. Since the random variables of and of are both Gaussian with mean zero and covariance,
we can deduce that
When the covariance components and are empirically estimated from the training data, notice that (22) is identical to the conditional expectation with respect to the kernel embedding conditional density formulated using the POD basis in (13). More importantly, one can analytically show that the autocovariance function (ACV) of the proposed non-Markovian model in (20) with agrees with the ACV of the component of the full model (see Appendix A for the detailed proof of this statement). The consistency of the ACV prediction as well as the closure in (22) with the RKHS formulation in (13) justifies the choice of when is large. In the numerics below, we will verify the robustness of the non-Markovian closure model resulted from this choice of in terms of the short-time prediction skill and ACVs for any .
In Figure 1, we compare the proposed closure model in (22) , which we will refer to as “RKHS ", with the standard averaging model in (17) with and the RKHS . In this numerical simulation, we build the closure models using simulated data at discrete time step . When is small, one can observe the pathwise convergence of solutions of the full model (14)-(15) to those of the closure models [Fig. 1(a)]. For small , the ACVs of all the closure models are in good agreement with the ACV of the full model [Fig. 1(b)]. These results agree with the invariant manifold theory the small regime . However, when is large, the predictions and ACVs become quite different among the three closure models [Figs. 1(c) and (d)]. In term of short-time predictions, the closure model (20) with memory terms provides a slightly better RMSE than the other two closure models [Fig. 1(c)]. In term of long-time statistics, only the closure model (20) with long memory terms produces an accurate approximation of the ACV, whereas the other two closure models do not [Fig. 1(d)]. This ACV consistency can be verified explicitly as we mentioned before (see Appendix A).
The analysis over this simple example shows that the proposed modeling framework using the kernel embedding of conditional density formulation provides accurate short-time predictions and consistent long-term statistical recoveries in the limit of the memory length . This consistency is robust whether the underlying full system has or does not have any temporal scale gap. Using this result as a guideline, a natural extension for compensating missing components in nonlinear systems is to consider , that allows for the missing dynamical components to also depend on the history of in addition to that of . In practice, the key parameters which will be determined case-by-case are the memory length, and , as we shall see in the nonlinear examples in the next section.
4 Nonlinear examples
In this section, we study the short-time predictions and long-time statistical properties of two nonlinear examples: the Lorenz-96 (L96) model  with a short memory effect and the truncated Burgers-Hopf (TBH) model [37, 33, 36, 39] with a long memory effect.
4.1 Two-layer Lorenz-96 model
Consider the two-layer Lorenz-96 (L96) model ,
for and , where each relevant variable is coupled to irrelevant variables , and
The indices of the variables and are cyclic, . The parameters are taken to be , , , , and . The parameter characterizes the time scale separation between variables and . In this example, we will show results of a small regime and a large (the large regime was studied in [6, 7, 27, 32]). We integrate the full L96 model using a 4th-order Runge-Kutta method for time units with a time step . We observe the trajectories of the variables every 10 time steps, that is, the observation time step is and the dataset contains observation points.
In the following numerical simulations, we compare the proposed closure RKHS model with the deterministic parametric formulation suggested by Wilk’s method . In particular, the Wilk’s deterministic parameterization scheme is a closure model obtained by fitting the data with the following polynomial,
We should point out that if we are restricted to only observing , then are the identifiable components that can be extracted, for e.g., using a likelihood maximum estimate [25, 31, 48] or an adaptive Bayesian filtering , as we pointed out in the introduction. The key point is that we cannot extract the detail components if the fast dynamical components in (23) is unknown and, in fact, we are not interested to construct the closure by averaging over conditional density that depends directly on . Instead, we will consider a closure model based on averaging over the following conditional density for small , where and . For the large regime, we will consider . While conditioning to other variables (e.g., spatial neighbors of or or longer temporal history) can be considered, we do not find any meaningful improvement over the results that are presented below. These densities will be constructed using the kernel embedding discussed in Section 2 for each ; connecting to the notation in the previous section, and is either or . Since these densities have low-dimensional conditional variables, we will represent the kernel embedding formula in (4) using the Hermite polynomials.
To validate the proposed approach, we compare the long-time statistics and short-time predictions of the -components of the full model and the closure models. Particularly, we compare several standard long-time statistical quantities as in [6, 31]:
For the PDFs, ACFs, and CCFs, we plot the average over all . For small , we only show the results for the PDFs and ACFs. For short-time predictions, we calculate the root-mean-square error (RMSE) and the anomaly correlation (ANCR), where the RMSE measures the difference between the true trajectory and the forecast trajectory whereas the ANCR measures the correlation between them . The definitions of RMSE and ANCR are the same as those in Ref. . We take the average using the data from different ensembles, each starting from a different initial state over five time units.
We first study the small regime of the L96 model. Figure 2(a) displays the scatter plot of vs. for the full L96 model, the polynomial fit (25) for the deterministic parametrization of (Wilks’s method), and the expectation using the RKHS representation (method referred to as the RKHS ). For long-time statistics, one can see from Figs. 2(b) and 2(c) that the PDFs and ACFs for can be well reproduced by both closure models. For short-time predictions, one can see from Fig. 2(d) that the RKHS can provide a better approximation of the trajectory than the Wilks’s deterministic parametrization scheme. These results can be expected due to the validity of the classical averaging theory on dynamical systems with time-scale separation (small regime) .
We now study the L96 model for the large regime in which there is no significant time-scale separation between the relevant variables and irrelevant variables . By comparing Fig. 2(a) and 3(a), one can see that the patterns of the scatter plots differ substantially between the small and large regimes. Specifically, the scatter plot for the large regime is much broader than that for the small regime. This indicates that when is small, the irrelevant (fast) variable significantly relies on the relevant (slow) variable. When becomes large, such dependence of irrelevant variable on the relevant variable reduces.
For large , one can observe from Fig. 3(a) that the RKHS representation of can nearly reproduce the scatter plot of the full model, whereas the Wilks’s deterministic parametrization scheme and the RKHS representation cannot. The PDFs for of the full model can be reproduced by all the closure models [Fig. 3(b)]. For the other long-time statistics, ACFs, CCFs, mean wave amplitudes, and wave variances can be well reproduced only by the closure model with the transition kernel [Figs. 3(c)(d)(e)(f)]. Notice also the significant improvement in terms of short-time prediction (smaller RMSE and higher ANCR) over the Wilks’s method and the RKHS as shown in Fig. 4.
To determine the reliability of the ensemble forecasts, we also calculate the rank histograms from ensemble integrations . A rank histogram is obtained by repeatedly tallying the rank of the true observation relative to the sorted -member ensemble . We use the same method as in . For every initial state , we do integrations of the closure model over the lead time units starting from the. We sort the values for for each grid point and time from the ensemble members and the full L96 model. Figure 5 displays the rank histograms for all the closure models with at lead time . An ideal rank histogram is flat. One can see that the rank histogram by the RKHS is close to be flat, whereas rank histograms by Wilks’s deterministic parametrization scheme and the RKHS exhibit U-shape distributions. Therefore, the closure model with performs better than the other two closure models.
4.2 The truncated Burgers-Hopf (TBH) model
This model is a Galerkin truncation of the inviscid Burgers equation on Fourier modes and we should point out that the dynamics of the truncated system is totally different from the inviscid Burgers equation. Particularly, the TBH exhibits intrinsic stochastic dynamics with ergodic behavior in a large deterministic system [37, 33, 36, 39]. We are interested in estimating the TBH model’s first Fourier mode given only the dynamical component of this mode,
where denotes the second Fourier mode and represents the forcing component obtained by subtracting from the right hand side of Eq. (26), that is,
While and may be identifiable from observing alone, in our experiment below, we assume that we are given the data set of . We should point out that this model has an equipartition energy, that is, all of the Fourier modes in TBH have the same variances, and the first Fourier mode (which is of our interest) possesses the longest autocorrelation time and the largest statistical memory , which makes this example a tough test problem.
To compensate for the missing dynamics in (27), we substitute the irrelevant variables and with their conditional expectations. The transition kernel is , where the irrelevant variable is one of , and the relevant variable is one of such that and are both real parts or both imaginary parts. In particular, we construct , , etc. For the force , an additional Gaussian noise term is added to compensate for the residual space. Since the conditional states are high-dimensional (when are large), the conditional expectations over these transition densities are constructed using the RKHS formulation using a POD basis as in (13).