1. Introduction
Spatiotemporal processes occur throughout the scientific disciplines with widespread applications in areas such as physics, economy, climate, ecology,… [Wikle2011, Wikle2019]. The respective data analysis requires the modeling of spatial and temporal dynamics. For processes that occur in nature, the spatiotemporal interaction is typically governed by a dynamic equation. In many cases the underlying dynamics are known a priori (e.g. from the laws of physics or economy) and a practitioner/ an algorithm searches for a solution that optimally matches spatiotemporal data. Learning exact solutions of dynamic equations is thus a key goal of spatiotemporal modeling. This article introduces a new kernelbased learning theory that applies to spatiotemporal data analysis under dynamic equation constraints. The constraints are implemented in terms of positive definite kernels that solve the dynamic equation exactly. This confers the main advantages of our method 1) coverage of domain constraints in terms of initial and boundary conditions and 2) exact modeling of the spatial and temporal dynamics of a process. Positive definite kernels play a fundamental role as building blocks of kernelregression and density estimation in nonparametric statistics [Silverman1986]
and of support vector machines
[SS2002]in machine learning. The representer theorem can be seen as the cardinal result in the application of kernelbased methods. It states that the minimizer of an empirical risk functional can be expanded in terms of kernels evaluated at data samples alone. On the technical side the main innovation behind our framework lies in the introduction of a dynamic riskminimization framework and a respective representer theorem that applies in the spatiotemporal setting.
Formally, the present article investigates the learning of a timedependent process from samples observed at different times , . The main underlying assumption is that the function can be expanded in terms of a convergent series of timedependent positive kernels
(1.1) 
with timeindependent coefficient . Functions of the form (1.1) occur commonly as solutions of timedependent partial differential equations (PDEs). Two flavors of the learning problem are studied here:

Spatiotemporal Kernel Regression: Suppose at times noisy data samples are taken from a function that evolves according to (1.1). What is the optimal fit to the total sample among the space of solutions?

Spatiotemporal Kernel Density estimation:
Samples are taken from an evolving density at times . In this case (1.1) reflects a distributional embedding [Smola2007], see below. We construct a kernel estimator for the evolving density that takes account of all samples simultaneously.
Our guiding application is the learning of solutions of timedependent diffusion equations of the form
(1.2) 
where is a (selfadjoint) operator of FokkerPlanck type, see below for details. represents either a regression function or density. This is motivated by the ubiquitous occurrence of Itô diffusion processes throughout scientific disciplines, see e.g. [Allen2010, Risken1989, Brigo2001]. The simplest instance of (1.2) is the heat equation
(1.3) 
which describes the density of the ordinary Brownian motion. Its solutions can be written in the form (1.1) with Gaussian kernels
Other common examples include the geometric Brownian motion as a model for stock prices and the HullWhite model (OrnsteinUhlenbeck process) for interest rates in economy.
In the context of PDEs, initial and boundary conditions play a crucial role. For regression they can lead to overdetermined learning problems, i.e. no solution of the PDE complies simultaneously with boundary conditions and measured data. In practice this can occur as a result of noise. In such cases our method can be applied to balance between mismatches in boundary conditions and measurement errors. For illustration Figure 1.1 shows an example of an overdetermined spatiotemporal regression with solutions of the heat equation.
Background on spatiotemporal models: A common method to account for the temporal discrepancy of data is inverse distance weighting (IDW) [Wikle2011, Wikle2019]. The main idea is that values at unknown points in time are computed as weighted averages of the available data snapshots, where higher weight is given to temporally closer data^{1}^{1}1This is known as ’Shepard’s method’ in spatial data analysis. Another name, ’Tobler’s law’, is common also for spatiotemporal problems [Miller2004] and stems from geography.. This form of IDW can be viewed as a spatiotemporal kernel method, where the weights are given by
Thus a ’straightforward’ approach to spatiotemporal modeling is to follow the IDW paradigm by incorporating timecoordinates as part of the observations into a given kernel. In reality, however, the evolution of a system is much less ’symmetric’ with respect to space and time and usually governed by a dynamic equation. This article introduces spatiotemporal modeling, where a timedependent kernel satisfies a PDE exactly, as an alternative to the IDW approach. Finally, hierarchical dynamic spatiotemporal models [Wikle2010] provide a framework that can capture a much wider variety of nonseparable models than kernel IDW but our kernels exactly follow the systems dynamic equation. The key advantages of our method are 1) that domain constraints are covered naturally in terms of initial and boundary conditions and 2) that it allows for an exact modeling of the spatial and temporal dynamics of a process. This is of particular interest for prediction, e.g. in the presence of data drifts that are captured within FokkerPlanck dynamics. Diffusionbased kernel density estimation [Botev2010] employs the FokkerPlanck equation as a resource for the construction of kernels under domain constrains and pilot density estimates. It appears natural to study such kernels also in the context of spatiotemporal modeling. Hence, the first main contribution of this article lies in the observation that spatiotemporal interdependence in many cases is much better captured via FokkerPlanck dynamics with timedependent kernel than a fixed multivariate kernel with optimized bandwidth. Figure 1.2 shows an example of spatiotemporal prediction comparing IDW and PDE approaches, see Example 11 for details. Remark the key feature of the new PDE predictor, that high error levels in training data, (A)–(C), are ’smoothed’ out to provide an accurate prediction (D) that is compliant with the underlying heat equation.
Background on PDEs: Kernel expansion techniques are commonly employed as a meshless (as opposed to finite difference) method for the numerical approximation of PDE solutions [Belytschko1996]. For timeindependent PDEs, kernel discretization represents the trial function entirely as a sum of ’datanodes’. This approach follows a classical line of [Trefftz1926] to use trial functions that satisfy a PDE exactly. For timedependent PDEs, meshless kernelbased methods usually rely on a timeindependent spatial sum representation, but the coefficients explicitly depend on time. In contrast the article at hand expands solutions as a series of timedependent kernels that solve (1.1) exactly (following more rigidly the lines of [Trefftz1926]). While this adds complexity to the structure of kernels it conveniently eliminates the time integration from solving the PDE and allows for a natural representer theorem. For an introduction to kernel methods with focus on meshless PDE solutions and machine learning see [Schaback2006]. Our work is inspired by the recent article [Hon2015], which studies the numerical approximation of solutions of (1.2) when is a spatial, elliptic operator. The important representation of solutions in the form (1.1) is introduced there as a method for meshless approximation. While [Hon2015]
focuses on kernelbased interpolation of initial conditions, here, the technique is extended to learn from spatiotemporal data. Thus the second main contribution of the present article lies in the recognition that the representation (
1.1) is particularly suited for statistics and machinelearning applications with spatiotemporal samples and in the derivation of a dynamic representer theorem for spatiotemporal models.1.1. Recap on statistical learning
We recapitulate some concepts of supervised learning. For details we refer the reader to the survey article
[Luxburg2008] and references therein. Let be a set of functions. Given a fixed but unknown probability distribution
on , the risk of ismeasures the discrepancy between and for simplicity is assumed to be convex in what follows. Risk measures by how much (accumulated over ) the learned responses deviate from the ground truth . The goal of learning is to identify that has as small as possible risk within the constraints imposed by ,
In practice, it is impossible to determine at least for the following reasons:
(1) Without structural constraints on , the minimization is intractable.
(2) is not known.
As a consequence of an a priori structure is commonly imposed on . Regarding point , a data sample is taken i.i.d. from . Given the sample, a natural approach is to minimizes the empirical risk
(1.4) 
The classical representer theorem asserts that if this minimization is performed over a reproducing kernel Hilbert space (RKHS), then any optimal function is a finite sum of kernels evaluated at sampled data points [Wahba1999, Christmann2008]. This reduces the infinitedimensional optimization problem to an dimensional subspace making it accessible to numerical optimization. For completeness, the classical representer theorem asserts the following.
Theorem 1 (Classical representer theorem).
Suppose that is a symmetric, positive definite kernel on a set . Let denote the RKHS spanned by convergent sums of kernels. Let denote a data sample taken from . Then
where
Remark 2.
In what follows we assume, without saying explicitly, that series of the form are convergent in the norm of the Hilbert space. This is justified by the fact that we only consider within its native (i.e. MooreAronszajn) RKHS [Schaback2006]. Similarly series involving timedependent kernels converge for fixed within the respective RKHS.
The article at hand introduces a time parameter to risk minimization. This involves a generalization of the empirical risk functional (1.4) and an extension of Thm. 1 beyond the setting of a single RKHS.
For completeness we mention the important topic of regularization. To prevent overfitting, a ’penalty term’ (regularizer) is added to the empirical risk functional. This has been introduced in [Kimeldorf1971] and extended to regularizers of the form with strictly increasing in [Schoelkopf2001]. The article [Argyriou2009] investigates a matrixvalued setup and derives conditions for a respective representer theorem. In the interest of space we do not cover regularization here, although regularized versions of our main theorems 5, 8 and 12 clearly hold.
1.2. Recap on the FokkerPlanck equation
We recapitulate some standard results from the theory of the FokkerPlanck equation. Details and derivations can be found in the monograph [Risken1989]
. The FokkerPlanck equation plays a central role in the theory of stochastic processes: a normalized solution of the equation represents the probability density of a random motion that follows a stochastic process of Itô type. The simplest physical setting is that of a particle undergoing a diffusion process. The simplest economic setting is that of a random market price of a financial asset. To set the stage, suppose a onedimensional Itô process
with timeindependent^{2}^{2}2Itô processes with timedependent drift or diffusion require a more general discussion although some of our findings still apply. drift and diffusion . The FokkerPlanck equation for the probability density
of the random variable
isFor brevity it is common to introduce the FokkerPlanck operator
writing the differential equation in the form (1.2). An important role in the study of boundary conditions of this PDE is played by the occurrence of a stationary density , i.e.
If a stationary solution exists then it can be written in the form
with a potential of the form and a normalization constant . The potential is a natural point to incorporate boundary conditions of Dirichlet type: If the potential jumps to an infinite value at then the density is confined to an interval . More generally, if the potential satisfies appropriate growth conditions then there is a unique stationary density of the above form. In this situation it is common to introduce a scalar product
The respective Hilbert function space is . A quick computation (that is detailed out in [Risken1989, Section 5.4]) shows that is selfadjoint with respect to under various types of boundary conditions. Hence, our working assumption is:
(A) The operator is a negative^{3}^{3}3
The assumption of negativity can be relaxed. If positive eigenvalues are present, a time horizon
is fixed for solutions of the PDE. The requirement is that the eigenfunction expansion of the kernel (
1.5) is convergent at . selfadjoint operator on the Hilbert space for a certain potential .In what follows only (A) itself is relevant rather than its exact origin. Conditions that are sufficient for (A) are described in the literature, see [Risken1989, Section 5.4] for an introductory treatment. The spectrum of can be discrete or continuous or both, a spectral decomposition follows from the spectral theorem. For convenience we focus on the discrete spectrum, but the discussion can be generalized using standard tools of functional analysis. In this situation, a complete orthonormal basis (wrt. ) diagonalizes ,
We introduce the kernel by the formula
(1.5) 
and notice that it solves (1.2) exactly. Notice also that the completeness relation for the eigenfunctions reads , i.e. the kernel is nothing but the Green’s function of the FokkerPlanck equation. One can obtain the general solution of the PDE with initial conditions as a convolution integral
(1.6) 
The representation for is of the same form as discussed in [Hon2015] for the example of the heat equation.
Lemma 3.
For any is a symmetric, positivedefinite kernel on and for any fixed the kernel generates a unique RKHS .
The simple proof is shown in the appendix.
Example 4.
(OrnsteinUhlenbeck process) To illustrate the framework, consider the OrnsteinUlenbeck (OU) process, which plays an important role in interest rate models in financial Mathematics [Brigo2001]. For fixed , the zeromean OU process is characterized by the stochastic differential equation
The process is meanreverting in the sense that if then the drift term is positive and if then the drift term is negative. determines the speed of mean reversion. The associated FokkerPlanck equation is
i.e. . The eigenvalues are , with eigenfunctions
Here denote the Hermite polynomials. The stationary density corresponds to the eigenvalue , that is
1.3. Recap on kernel density estimation
Given independent realizations
from an unknown continuous probability density function
on and a positive kernel , the kernel density estimator is(1.7) 
The methodology developed here relies on the kernel embedding of distributions of [Smola2007]. Given a kernel on the distributional embedding of a density into the RKHS of is
On the technical side it should be mentioned that the integral is assumed to exist and that the embedding is assumed to be injective. A detailed discussion of the technical backbone can be found in [Smola2007, Steinwart2002]. The embedding allows one to operate on distributions using Hilbert space concepts such as inner products and linear projections. For us it serves two purposes.
1) It is the entrance point to interpret density estimation within the risk minimization framework.
2) In the context of evolving densities and timedependent kernels, the embedding naturally mirrors the structural properties of the PDE that governs the evolution.
The kernel mean estimator (KME) is the minimizer of the empirical risk functional (compare (1.4)) of the embedding [Muandet2014, STC2004],
(1.8) 
where denotes the norm of the RKHS. The representer theorem, Thm. 1, applies and yields a representation of the KME in the form of an ordinary kernel density estimator . In the absence of a regularizer, equal weights are optimal [STC2004, Prop. 5.2], which yields (1.7).
2. Results
2.1. Supervised Learning of timedependent functions
Let noisy data samples be taken at different times , and let denote the full data sample over all times. Let be a timedependent function to be estimated via kernel regression from a set of admissible functions , using information contained in . We define the empirical risk of over as an equally weighted sum of the empirical risks of over ,
(2.1) 
This generalizes (1.4) to the case that but can also be viewed as a special case in the sense that is interpreted as the underlying data sample. We begin by studying the situation when samples are taken from a countable set assuming that is composed of functions of the form , with symmetric and positive kernels .
Theorem 5 (Timedependent representer theorem).
Suppose that is a symmetric, positive definite kernel on a set for every . Let
and suppose that at times , , data samples are measured and that . Then
where
The point is that, although it is not obvious how an RKHS structure arises from timedependent kernels simultaneously over , still a representer theorem applies. The consequence is that the optimization task is feasible over the infinite set , i.e. it is sufficient to minimize over a list of parameters . The proof is presented in the appendix.
Remark 6.
In case the loss function is
the kernel regression problem can be solved explicitly in terms of the MoorePenrose pseudo inverse. Let
denote the vectors of entries
and and let be the matrix whose entries are . The system of linear equationscan have no, a unique, or an infinite number of solutions. In either case the best match in terms of smallest norm is given by
where denotes the MoorePenrose pseudoinverse of .
Example 7.
(Minimal example) Suppose measurements outcomes
are obtained at times and from a function of the form . Suppose the 2norm is used to measure loss. We have the following representation for the matrix
where and . Computing using appropriate software yields the optimal coefficients .
The theorem assumes a representation of candidate functions in terms of infinite series (corresponding to a countable set ). The convolution integral representation (1.6) of PDE solutions is a continuous version of (1.1). The techniques for formalizing this generalization are, of course, well established. [Christmann2008, Section 5] contains a rigorous discussion. To apply Thm. 5 to the FokkerPlanck equation we represent its solution in terms of the convolution integral.
Theorem 8.
Let be the set of solutions of the FokkerPlanck equation with
and and boundary conditions such that satisfies assumption (A). Suppose that at times , , data samples are measured. Then
where
The theorem links two different optimization exercises related to the PDE. On the one hand, there is the original optimization, which asks to minimize the empirical risk over solutions of the FokkerPlanck equation. On the other hand, there is the optimization of over functions i.e. over a set of coefficients . By the Green’s function property of the kernels, the coefficients reflect the initial conditions . In other words the optimization in Thm. 8 is, de facto, an optimization over initial conditions. If the PDE is given with a priori initial conditions of the form then its solution is unique and the optimization is, in principle, trivial. However, in practice initial conditions can be furnished with small errors (e.g. from a measurement process). Such initial conditions can be naturally incorporated into the setting of Thm. 8 by adding a data sample of the form to . In this case the empirical risk decomposes as
(2.2) 
and Thm. 8 is applied to the second term.
Example 9.
(Heat Equation with Initial Conditions) Suppose the temperature of a metal rod is measured are times , . The initial condition is interpreted as a soft condition (and represented as a measurement at ). temperature sensors are placed at equal distances over the interval . It is assumed that temperature evolves according to the heat equation and the norm measures loss. For illustration we suppose a measurement error of the form at and . Measurement outcomes and the optimal match solution obtained by minimizing (2.2) are shown in Figure 2.1.
Boundary conditions can be treated in similar vein. Hard boundary conditions imply that the kernels comply with the boundary conditions exactly. If a certain level of error can be tolerated, Thm. 8 can be used to trade off between regression error and boundary conditions.
Example 10.
(Heat Equation with hard Dirichlet Boundary Conditions) Suppose, as before, that temperature sensors are equally placed over a metal rod and suppose measurements occur at times , , . Suppose the rod is restricted to an interval and has temperature at . As before temperature evolves according to the heat equation and the norm measures loss. For illustration it is assumed that measurement outcomes can be described by the functions at , by at and by at , see Figure 1.1. We apply Thm. 8 to identify the optimal match solution to the heat equation with Dirichlet conditions. Straight forward computations show that
Figure 1.1 shows the optimal match solution obtained from Thm. 8 using the pseudoinverse.
In the real world dynamics the introduced framework can be expected to perform better in terms of prediction than agnostic techniques like IDW. This is illustrated by the following example.
Example 11.
(Prediction under hard Dirichlet Boundary Conditions) Suppose that as in Example 10 temperature measurements are taken at times , , from a metal rod with hard Dirichlet boundary conditions. Our goal is to predict the temperature at the future point in time given high levels of measurement error in the individual samples. The latter is modeled by the function and added to each measurement. Two predictors are set up for comparison. The first is computed from Thm. 8. The second is the ordinary Gaussian kernel predictor , where the timecoordinates have been included as part of the observation. The bandwidth has been calibrated via cross validation to . Figure 1.1 shows graphs of matches to data and respective predictions. The PDE predictor averages measurement error at different times to produce an accurate estimate of the exact evolution.
2.2. Learning a timedependent density
Suppose that a timedependent density evolves according to (1.2) and that samples are taken from at times , . We construct a kernel density estimator that simultaneously takes account of all samples while retaining consistency with (1.2). To this aim we embed the initial density into the RKHS of the FokkerPlanck kernels , representing it as
Notice that this convolution integral automatically solves the FokkerPlanck equation. Thus the embedding lifts the evolution of the density to the level of the RKHS and the developed regression theory applies. We generalize the KME by choosing the minimizer of
(2.3) 
as a “simultaneous density estimator”. This functional arises from the timedependent empirical risk (2.1) in the same way as the KME risk functional (1.8) arises from the ordinary (timeindependent) empirical risk (1.4).
Theorem 12 (Timedependent representer theorem for density estimation).
Let be the set of densities that solve the FokkerPlanck equation with
and and boundary conditions such that satisfies assumption (A). Suppose that at times , , samples are taken from the density . Then
where
Thm. 12 is an immediate consequence of Thm. 8, which is applied to (2.3) to demonstrate that the minimizer is as a finite sum of kernels.
Remark 13.
Similar to regression, the optimization can be solved explicitly in terms of the pseudoinverse. First suppose , let be the matrix whose entries are and let . Let denote vectors of entries and . A quick computation shows that
Of course the minimum is as mentioned before. This reasoning applies mutatis mutandis when : The minimum of
where now , , and has entries corresponding to and else, can be computed using the pseudoinverse of .
As before the optimization over all elements of is an optimization over possible choices of initial density. The optimal solution corresponds to a list of coefficients and reflects the initial conditions . Given these initial conditions the solution of the PDE is unique.
A common situation in kernel density estimation is that the domain of data is known in advance, say . This leads to the topic of boundary conditions. The homogeneous Neumann boundary conditions
ensure that , which entails
Example 14.
(Learning an evolving density on ) Suppose that two samples each of size are taken at times and from a density that evolves according to the heat equation with homogeneous Neumann boundary conditions on . For illustration we suppose that at time the density is the beta density and we construct an estimator for the density at that accounts for both samples.
3. Conclusion
Timedependent PDEs occur in countless situations throughout scientific disciplines. Kernel methods are commonly employed in PDE theory [Schaback2006] but the existing techniques mostly focus on the timeindependent context. Similarly, static kernel methods are common in spatiotemporal modeling and fall under the general IDW paradigm. Models that accurately take account of the system dynamics are much less common. The article at hand introduces a new kernelbased paradigm for spatiotemporal modeling based on timedependent kernels that realize the FokkerPlanck dynamics. A respective kernelbased learning theory for timedependent PDEs is introduced and a representer theorem is provided for the application of dynamic kernel techniques.
Our kernel density estimators are closely related to the famous diffusionestimator of [Botev2010]. The latter uses the FokkerPlanck equation as a resource for the construction of estimators under prior information, such as domain constrains and a pilot density estimate. On the theoretical side we expect applications of our method in the construction of diffusion estimators from multiple correlated samples along the lines of [Botev2010].
4. Appendix
Proof of Lemma 3.
Symmetry of the kernel, i.e. follows directly by plugging in. To show positivedefiniteness set and compute which is positivedefinite because . ∎
Proof of Thm. 5.
By definition , i.e.
Fix and let, by the MooreAronszajn theorem, be the unique reproducing kernel Hilbert space with kernel . Evaluating at simply gives . We view (for fixed ) the function as an element of . To emphasize this we introduce the notation , i.e. . Let be the projector in on
and let be the projector onto the orthogonal complement . For any function , which is contained in the orthogonal complement of the reproducing property of gives
This implies As a consequence we find that
Comments
There are no comments yet.