# An exact kernel framework for spatio-temporal dynamics

A kernel-based framework for spatio-temporal data analysis is introduced that applies in situations when the underlying system dynamics are governed by a dynamic equation. The key ingredient is a representer theorem that involves time-dependent kernels. Such kernels occur commonly in the expansion of solutions of partial differential equations. The representer theorem is applied to find among all solutions of a dynamic equation the one that minimizes the error with given spatio-temporal samples. This is motivated by the fact that very often a differential equation is given a priori (e.g. by the laws of physics) and a practitioner seeks the best solution that is compatible with her noisy measurements. Our guiding example is the Fokker-Planck equation, which describes the evolution of density in stochastic diffusion processes. A regression and density estimation framework is introduced for spatio-temporal modeling under Fokker-Planck dynamics with initial and boundary conditions.

## Authors

• 3 publications
• 10 publications
• 2 publications
11/16/2021

### Non-separable Spatio-temporal Graph Kernels via SPDEs

Gaussian processes (GPs) provide a principled and direct approach for in...
09/20/2021

### Deep Spatio-temporal Sparse Decomposition for Trend Prediction and Anomaly Detection in Cardiac Electrical Conduction

Electrical conduction among cardiac tissue is commonly modeled with part...
05/30/2020

### A Spatio-Temporal Kernel Density Estimation Framework for Predictive Crime Hotspot Mapping and Evaluation

Predictive hotspot mapping plays a critical role in hotspot policing. Ex...
12/19/2019

### Temporal Normalizing Flows

Analyzing and interpreting time-dependent stochastic data requires accur...
11/17/2018

### Deep Learning of Turbulent Scalar Mixing

Based on recent developments in physics-informed deep learning and deep ...
11/20/2018

### A Hierarchical Spatio-Temporal Statistical Model for Physical Systems

In this paper, we extend and analyze a Bayesian hierarchical spatio-temp...
06/13/2018

### A general framework for SPDE-based stationary random fields

This paper presents theoretical advances in the application of the Stoch...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

Spatio-temporal processes occur throughout the scientific disciplines with wide-spread applications in areas such as physics, economy, climate, ecology,… [Wikle2011, Wikle2019]. The respective data analysis requires the modeling of spatial and temporal dynamics. For processes that occur in nature, the spatio-temporal interaction is typically governed by a dynamic equation. In many cases the underlying dynamics are known a priori (e.g. from the laws of physics or economy) and a practitioner/ an algorithm searches for a solution that optimally matches spatio-temporal data. Learning exact solutions of dynamic equations is thus a key goal of spatio-temporal modeling. This article introduces a new kernel-based learning theory that applies to spatio-temporal data analysis under dynamic equation constraints. The constraints are implemented in terms of positive definite kernels that solve the dynamic equation exactly. This confers the main advantages of our method 1) coverage of domain constraints in terms of initial and boundary conditions and 2) exact modeling of the spatial and temporal dynamics of a process. Positive definite kernels play a fundamental role as building blocks of kernel-regression and density estimation in non-parametric statistics [Silverman1986]

and of support vector machines

[SS2002]

in machine learning. The representer theorem can be seen as the cardinal result in the application of kernel-based methods. It states that the minimizer of an empirical risk functional can be expanded in terms of kernels evaluated at data samples alone. On the technical side the main innovation behind our framework lies in the introduction of a dynamic risk-minimization framework and a respective representer theorem that applies in the spatio-temporal setting.

Formally, the present article investigates the learning of a time-dependent process from samples observed at different times , . The main underlying assumption is that the function can be expanded in terms of a convergent series of time-dependent positive kernels

 (1.1) f(x,t)=∞∑μ=1aμKt(x,xμ),

with time-independent coefficient . Functions of the form (1.1) occur commonly as solutions of time-dependent partial differential equations (PDEs). Two flavors of the learning problem are studied here:

1. Spatio-temporal Kernel Regression: Suppose at times noisy data samples are taken from a function that evolves according to (1.1). What is the optimal fit to the total sample among the space of solutions?

2. Spatio-temporal Kernel Density estimation:

Samples are taken from an evolving density at times . In this case (1.1) reflects a distributional embedding [Smola2007], see below. We construct a kernel estimator for the evolving density that takes account of all samples simultaneously.

Our guiding application is the learning of solutions of time-dependent diffusion equations of the form

 (1.2) Lu(x,t)=∂tu(x,t),

where is a (self-adjoint) operator of Fokker-Planck type, see below for details. represents either a regression function or density. This is motivated by the ubiquitous occurrence of Itô diffusion processes throughout scientific disciplines, see e.g. [Allen2010, Risken1989, Brigo2001]. The simplest instance of (1.2) is the heat equation

 (1.3) ∂xxu(x,t)=∂tu(x,t),

which describes the density of the ordinary Brownian motion. Its solutions can be written in the form (1.1) with Gaussian kernels

 KGausst(x,xi)=1√2πte−(x−xi)22t.

Other common examples include the geometric Brownian motion as a model for stock prices and the Hull-White model (Ornstein-Uhlenbeck process) for interest rates in economy.

In the context of PDEs, initial and boundary conditions play a crucial role. For regression they can lead to over-determined learning problems, i.e. no solution of the PDE complies simultaneously with boundary conditions and measured data. In practice this can occur as a result of noise. In such cases our method can be applied to balance between mismatches in boundary conditions and measurement errors. For illustration Figure 1.1 shows an example of an over-determined spatio-temporal regression with solutions of the heat equation.

Background on spatio-temporal models: A common method to account for the temporal discrepancy of data is inverse distance weighting (IDW) [Wikle2011, Wikle2019]. The main idea is that values at unknown points in time are computed as weighted averages of the available data snap-shots, where higher weight is given to temporally closer data111This is known as ’Shepard’s method’ in spatial data analysis. Another name, ’Tobler’s law’, is common also for spatio-temporal problems [Miller2004] and stems from geography.. This form of IDW can be viewed as a spatio-temporal kernel method, where the weights are given by

 w(ik)(jl)=K((x(k)i,tk),(x(l)j,tl)).

Thus a ’straight-forward’ approach to spatio-temporal modeling is to follow the IDW paradigm by incorporating time-coordinates as part of the observations into a given kernel. In reality, however, the evolution of a system is much less ’symmetric’ with respect to space and time and usually governed by a dynamic equation. This article introduces spatio-temporal modeling, where a time-dependent kernel satisfies a PDE exactly, as an alternative to the IDW approach. Finally, hierarchical dynamic spatio-temporal models [Wikle2010] provide a framework that can capture a much wider variety of non-separable models than kernel IDW but our kernels exactly follow the systems dynamic equation. The key advantages of our method are 1) that domain constraints are covered naturally in terms of initial and boundary conditions and 2) that it allows for an exact modeling of the spatial and temporal dynamics of a process. This is of particular interest for prediction, e.g. in the presence of data drifts that are captured within Fokker-Planck dynamics. Diffusion-based kernel density estimation [Botev2010] employs the Fokker-Planck equation as a resource for the construction of kernels under domain constrains and pilot density estimates. It appears natural to study such kernels also in the context of spatio-temporal modeling. Hence, the first main contribution of this article lies in the observation that spatio-temporal inter-dependence in many cases is much better captured via Fokker-Planck dynamics with time-dependent kernel than a fixed multivariate kernel with optimized bandwidth. Figure 1.2 shows an example of spatio-temporal prediction comparing IDW and PDE approaches, see Example 11 for details. Remark the key feature of the new PDE predictor, that high error levels in training data, (A)–(C), are ’smoothed’ out to provide an accurate prediction (D) that is compliant with the underlying heat equation.

Background on PDEs: Kernel expansion techniques are commonly employed as a meshless (as opposed to finite difference) method for the numerical approximation of PDE solutions [Belytschko1996]. For time-independent PDEs, kernel discretization represents the trial function entirely as a sum of ’data-nodes’. This approach follows a classical line of [Trefftz1926] to use trial functions that satisfy a PDE exactly. For time-dependent PDEs, meshless kernel-based methods usually rely on a time-independent spatial sum representation, but the coefficients explicitly depend on time. In contrast the article at hand expands solutions as a series of time-dependent kernels that solve (1.1) exactly (following more rigidly the lines of [Trefftz1926]). While this adds complexity to the structure of kernels it conveniently eliminates the time integration from solving the PDE and allows for a natural representer theorem. For an introduction to kernel methods with focus on meshless PDE solutions and machine learning see [Schaback2006]. Our work is inspired by the recent article [Hon2015], which studies the numerical approximation of solutions of (1.2) when is a spatial, elliptic operator. The important representation of solutions in the form (1.1) is introduced there as a method for meshless approximation. While [Hon2015]

focuses on kernel-based interpolation of initial conditions, here, the technique is extended to learn from spatio-temporal data. Thus the second main contribution of the present article lies in the recognition that the representation (

1.1) is particularly suited for statistics and machine-learning applications with spatio-temporal samples and in the derivation of a dynamic representer theorem for spatio-temporal models.

### 1.1. Recap on statistical learning

We recapitulate some concepts of supervised learning. For details we refer the reader to the survey article

[Luxburg2008] and references therein. Let be a set of functions

. Given a fixed but unknown probability distribution

on , the risk of is

 R(f):=∫X×Yloss(f(x),y)dP(x,y).

measures the discrepancy between and for simplicity is assumed to be convex in what follows. Risk measures by how much (accumulated over ) the learned responses deviate from the ground truth . The goal of learning is to identify that has as small as possible risk within the constraints imposed by ,

 f∗=argminf∈FR(f).

In practice, it is impossible to determine at least for the following reasons:

(1) Without structural constraints on , the minimization is intractable.

(2) is not known.

As a consequence of an a priori structure is commonly imposed on . Regarding point , a data sample is taken i.i.d. from . Given the sample, a natural approach is to minimizes the empirical risk

 (1.4) RSemp(f):=1NN∑i=1loss(f(xi),yi).

The classical representer theorem asserts that if this minimization is performed over a reproducing kernel Hilbert space (RKHS), then any optimal function is a finite sum of kernels evaluated at sampled data points [Wahba1999, Christmann2008]. This reduces the infinite-dimensional optimization problem to an -dimensional sub-space making it accessible to numerical optimization. For completeness, the classical representer theorem asserts the following.

###### Theorem 1 (Classical representer theorem).

Suppose that is a symmetric, positive definite kernel on a set . Let denote the RKHS spanned by convergent sums of kernels. Let denote a data sample taken from . Then

 inff∈HRSemp(f)=minf∈HrepRSemp(f),

where

 Hrep={f=f(x) | f(x)=N∑μ=1aμK(x,xμ)}.
###### Remark 2.

In what follows we assume, without saying explicitly, that series of the form are convergent in the norm of the Hilbert space. This is justified by the fact that we only consider within its native (i.e. Moore-Aronszajn) RKHS [Schaback2006]. Similarly series involving time-dependent kernels converge for fixed within the respective RKHS.

The article at hand introduces a time parameter to risk minimization. This involves a generalization of the empirical risk functional (1.4) and an extension of Thm. 1 beyond the setting of a single RKHS.

For completeness we mention the important topic of regularization. To prevent overfitting, a ’penalty term’ (regularizer) is added to the empirical risk functional. This has been introduced in [Kimeldorf1971] and extended to regularizers of the form with strictly increasing in [Schoelkopf2001]. The article [Argyriou2009] investigates a matrix-valued setup and derives conditions for a respective representer theorem. In the interest of space we do not cover regularization here, although regularized versions of our main theorems 58 and 12 clearly hold.

### 1.2. Recap on the Fokker-Planck equation

We recapitulate some standard results from the theory of the Fokker-Planck equation. Details and derivations can be found in the monograph [Risken1989]

. The Fokker-Planck equation plays a central role in the theory of stochastic processes: a normalized solution of the equation represents the probability density of a random motion that follows a stochastic process of Itô type. The simplest physical setting is that of a particle undergoing a diffusion process. The simplest economic setting is that of a random market price of a financial asset. To set the stage, suppose a one-dimensional Itô process

 dXt=μ(Xt)dt+σ(Xt)dWt

with time-independent222Itô processes with time-dependent drift or diffusion require a more general discussion although some of our findings still apply. drift and diffusion . The Fokker-Planck equation for the probability density

of the random variable

is

 ∂tρ(x,t)=−∂x[μρ(x,t)]+∂xx[Dρ(x,t)].

For brevity it is common to introduce the Fokker-Planck operator

 L:=−∂xμ+∂xxD

writing the differential equation in the form (1.2). An important role in the study of boundary conditions of this PDE is played by the occurrence of a stationary density , i.e.

 limt→∞ρ(x,t)=ρfix(x).

If a stationary solution exists then it can be written in the form

 ρfix(x)=Ce−Φ(x),

with a potential of the form and a normalization constant . The potential is a natural point to incorporate boundary conditions of Dirichlet type: If the potential jumps to an infinite value at then the density is confined to an interval . More generally, if the potential satisfies appropriate growth conditions then there is a unique stationary density of the above form. In this situation it is common to introduce a scalar product

 ⟨f|g⟩eΦ=∫f(x)g(x)eΦ(x)dx.

The respective Hilbert function space is . A quick computation (that is detailed out in [Risken1989, Section 5.4]) shows that is self-adjoint with respect to under various types of boundary conditions. Hence, our working assumption is:

(A) The operator is a negative333

The assumption of negativity can be relaxed. If positive eigenvalues are present, a time horizon

is fixed for solutions of the PDE. The requirement is that the eigenfunction expansion of the kernel (

1.5) is convergent at . self-adjoint operator on the Hilbert space for a certain potential .

In what follows only (A) itself is relevant rather than its exact origin. Conditions that are sufficient for (A) are described in the literature, see [Risken1989, Section 5.4] for an introductory treatment. The spectrum of can be discrete or continuous or both, a spectral decomposition follows from the spectral theorem. For convenience we focus on the discrete spectrum, but the discussion can be generalized using standard tools of functional analysis. In this situation, a complete orthonormal basis (wrt. ) diagonalizes ,

 Lφn=−λnφn,λn≥0,n=1,2,...

We introduce the kernel by the formula

 (1.5) KFPt(x,x′)=∞∑n=1e−λntφn(x)φTn(x′)

and notice that it solves (1.2) exactly. Notice also that the completeness relation for the eigenfunctions reads , i.e. the kernel is nothing but the Green’s function of the Fokker-Planck equation. One can obtain the general solution of the PDE with initial conditions as a convolution integral

 (1.6) f(x,t)=∫Xg(x′)KFPt(x,x′)dx′.

The representation for is of the same form as discussed in [Hon2015] for the example of the heat equation.

###### Lemma 3.

For any is a symmetric, positive-definite kernel on and for any fixed the kernel generates a unique RKHS .

The simple proof is shown in the appendix.

###### Example 4.

(Ornstein-Uhlenbeck process) To illustrate the framework, consider the Ornstein-Ulenbeck (OU) process, which plays an important role in interest rate models in financial Mathematics [Brigo2001]. For fixed , the zero-mean OU process is characterized by the stochastic differential equation

 dXt=−θXtdt+σdWt.

The process is mean-reverting in the sense that if then the drift term is positive and if then the drift term is negative. determines the speed of mean reversion. The associated Fokker-Planck equation is

 ∂tρ=θ∂x(xρ)+σ22∂xxρ,

i.e. . The eigenvalues are , with eigenfunctions

 φn(x)=1√√2πn!e−12(x/√σ22θ)2Hn(x/√σ22θ).

Here denote the Hermite polynomials. The stationary density corresponds to the eigenvalue , that is

### 1.3. Recap on kernel density estimation

Given independent realizations

from an unknown continuous probability density function

on and a positive kernel , the kernel density estimator is

 (1.7) ^ρ(x)=1NN∑i=1K(x,Xi).

The methodology developed here relies on the kernel embedding of distributions of [Smola2007]. Given a kernel on the distributional embedding of a density into the RKHS of is

 [ε(ρ)](x)=∫XK(x,x′)ρ(x′)dx′∈HK.

On the technical side it should be mentioned that the integral is assumed to exist and that the embedding is assumed to be injective. A detailed discussion of the technical backbone can be found in [Smola2007, Steinwart2002]. The embedding allows one to operate on distributions using Hilbert space concepts such as inner products and linear projections. For us it serves two purposes.

1) It is the entrance point to interpret density estimation within the risk minimization framework.

2) In the context of evolving densities and time-dependent kernels, the embedding naturally mirrors the structural properties of the PDE that governs the evolution.

The kernel mean estimator (KME) is the minimizer of the empirical risk functional (compare (1.4)) of the embedding [Muandet2014, STC2004],

 (1.8) RKME(ε)=1NN∑i=1||ε−K(⋅,xi)||2RKHS,

where denotes the norm of the RKHS. The representer theorem, Thm. 1, applies and yields a representation of the KME in the form of an ordinary kernel density estimator . In the absence of a regularizer, equal weights are optimal [STC2004, Prop. 5.2], which yields (1.7).

## 2. Results

### 2.1. Supervised Learning of time-dependent functions

Let noisy data samples be taken at different times , and let denote the full data sample over all times. Let be a time-dependent function to be estimated via kernel regression from a set of admissible functions , using information contained in . We define the empirical risk of over as an equally weighted sum of the empirical risks of over ,

 (2.1) RSemp(f) =1TT∑k=1RSkemp(f(⋅,tk)).

This generalizes (1.4) to the case that but can also be viewed as a special case in the sense that is interpreted as the underlying data sample. We begin by studying the situation when samples are taken from a countable set assuming that is composed of functions of the form , with symmetric and positive kernels .

###### Theorem 5 (Time-dependent representer theorem).

Suppose that is a symmetric, positive definite kernel on a set for every . Let

 G={f=f(x,t) | f(x,t)=∞∑μ=1aμKt(x,xμ)}

and suppose that at times , , data samples are measured and that . Then

 inff∈GRSemp(f)=minf∈GrepRSemp(f),

where

 Grep={f=f(x,t) | f(x,t)=T∑k=1N∑i=1ai,kKt(x,x(k)i)}.

The point is that, although it is not obvious how an RKHS structure arises from time-dependent kernels simultaneously over , still a representer theorem applies. The consequence is that the optimization task is feasible over the infinite set , i.e. it is sufficient to minimize over a list of parameters . The proof is presented in the appendix.

###### Remark 6.

In case the loss function is

 loss(f(x),y)=(f(x)−y)2,

the kernel regression problem can be solved explicitly in terms of the Moore-Penrose pseudo inverse. Let

denote the vectors of entries

and and let be the matrix whose entries are . The system of linear equations

 →y=K⋅→a,

can have no, a unique, or an infinite number of solutions. In either case the best match in terms of smallest -norm is given by

 →a∗=K+⋅→y,

where denotes the Moore-Penrose pseudo-inverse of .

###### Example 7.

(Minimal example) Suppose measurements outcomes

 S1 ={(x(1)1=1,y(1)1=1),(x(1)2=2,y(1)2=1)}, S2 ={(x(2)1=1,y(2)1=1),(x(2)2=2,y(2)2=2)}

are obtained at times and from a function of the form . Suppose the 2-norm is used to measure loss. We have the following representation for the matrix

 ⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝Kt1(x(1)1,x(1)1)Kt1(x(1)1,x(1)2)...Kt1(x(1)1,x(2)2)Kt1(x(1)2,x(1)1)Kt1(x(1)2,x(1)2)...Kt1(x(1)2,x(2)2)Kt2(x(2)1,x(1)1)Kt2(x(2)1,x(1)2)...Kt2(x(2)1,x(2)2)Kt2(x(2)2,x(1)1)Kt2(x(2)2,x(1)2)...Kt2(x(2)2,x(2)2)⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠

where and . Computing using appropriate software yields the optimal coefficients .

The theorem assumes a representation of candidate functions in terms of infinite series (corresponding to a countable set ). The convolution integral representation (1.6) of PDE solutions is a continuous version of (1.1). The techniques for formalizing this generalization are, of course, well established. [Christmann2008, Section 5] contains a rigorous discussion. To apply Thm. 5 to the Fokker-Planck equation we represent its solution in terms of the convolution integral.

###### Theorem 8.

Let be the set of solutions of the Fokker-Planck equation with

 L=−∂xμ+∂xxD,

and and boundary conditions such that satisfies assumption (A). Suppose that at times , , data samples are measured. Then

 inff∈FRSemp(f)=minf∈FrepRSemp(f),

where

 Frep={f=f(x,t)|f(x,t)=T∑k=1N∑i=1ai,kKFPt(x,x(k)i)}.

The theorem links two different optimization exercises related to the PDE. On the one hand, there is the original optimization, which asks to minimize the empirical risk over solutions of the Fokker-Planck equation. On the other hand, there is the optimization of over functions i.e. over a set of coefficients . By the Green’s function property of the kernels, the coefficients reflect the initial conditions . In other words the optimization in Thm. 8 is, de facto, an optimization over initial conditions. If the PDE is given with a priori initial conditions of the form then its solution is unique and the optimization is, in principle, trivial. However, in practice initial conditions can be furnished with small errors (e.g. from a measurement process). Such initial conditions can be naturally incorporated into the setting of Thm. 8 by adding a data sample of the form to . In this case the empirical risk decomposes as

 (2.2) N∑i=1loss(g(x(0)i),y(0)i)+T∑k=1N∑i=1loss(f(x(k)i,tk),y(k)i)

and Thm. 8 is applied to the second term.

###### Example 9.

(Heat Equation with Initial Conditions) Suppose the temperature of a metal rod is measured are times , . The initial condition is interpreted as a soft condition (and represented as a measurement at ). temperature sensors are placed at equal distances over the interval . It is assumed that temperature evolves according to the heat equation and the -norm measures loss. For illustration we suppose a measurement error of the form at and . Measurement outcomes and the optimal match solution obtained by minimizing (2.2) are shown in Figure 2.1.

Boundary conditions can be treated in similar vein. Hard boundary conditions imply that the kernels comply with the boundary conditions exactly. If a certain level of error can be tolerated, Thm. 8 can be used to trade off between regression error and boundary conditions.

###### Example 10.

(Heat Equation with hard Dirichlet Boundary Conditions) Suppose, as before, that temperature sensors are equally placed over a metal rod and suppose measurements occur at times , , . Suppose the rod is restricted to an interval and has temperature at . As before temperature evolves according to the heat equation and the -norm measures loss. For illustration it is assumed that measurement outcomes can be described by the functions at , by at and by at , see Figure 1.1. We apply Thm. 8 to identify the optimal match solution to the heat equation with Dirichlet conditions. Straight forward computations show that

 φn=√2sin(nπx), λn=n2π2,
 KDirichBoundt(x,x′)=2∑n≥0e−n2π2tsin(nπx)sin(nπx′).

Figure 1.1 shows the optimal match solution obtained from Thm. 8 using the pseudo-inverse.

In the real world dynamics the introduced framework can be expected to perform better in terms of prediction than agnostic techniques like IDW. This is illustrated by the following example.

###### Example 11.

(Prediction under hard Dirichlet Boundary Conditions) Suppose that as in Example 10 temperature measurements are taken at times , , from a metal rod with hard Dirichlet boundary conditions. Our goal is to predict the temperature at the future point in time given high levels of measurement error in the individual samples. The latter is modeled by the function and added to each measurement. Two predictors are set up for comparison. The first is computed from Thm. 8. The second is the ordinary Gaussian kernel predictor , where the time-coordinates have been included as part of the observation. The bandwidth has been calibrated via cross validation to . Figure 1.1 shows graphs of matches to data and respective predictions. The PDE predictor averages measurement error at different times to produce an accurate estimate of the exact evolution.

### 2.2. Learning a time-dependent density

Suppose that a time-dependent density evolves according to (1.2) and that samples are taken from at times , . We construct a kernel density estimator that simultaneously takes account of all samples while retaining consistency with (1.2). To this aim we embed the initial density into the RKHS of the Fokker-Planck kernels , representing it as

 [ε(ρ)](x,t)=∫XKFPt(x,x′)ρ(x′)dx′∈HKt.

Notice that this convolution integral automatically solves the Fokker-Planck equation. Thus the embedding lifts the evolution of the density to the level of the RKHS and the developed regression theory applies. We generalize the KME by choosing the minimizer of

 (2.3) RSKME(ε)=1TT∑k=1RS(k)KME(ε(⋅,tk))

as a “simultaneous density estimator”. This functional arises from the time-dependent empirical risk (2.1) in the same way as the KME risk functional (1.8) arises from the ordinary (time-independent) empirical risk (1.4).

###### Theorem 12 (Time-dependent representer theorem for density estimation).

Let be the set of densities that solve the Fokker-Planck equation with

 L=−∂xμ+∂xxD,

and and boundary conditions such that satisfies assumption (A). Suppose that at times , , samples are taken from the density . Then

 infρ∈PRSKME(ε(ρ))=minε∈PrepRSKME(ε),

where

 Prep={ε=ε(x,t)|ε(x,t)=T∑k=1N∑i=1βi,kKFPt(x,x(k)i)}.

Thm. 12 is an immediate consequence of Thm. 8, which is applied to (2.3) to demonstrate that the minimizer is as a finite sum of kernels.

###### Remark 13.

Similar to regression, the optimization can be solved explicitly in terms of the pseudo-inverse. First suppose , let be the matrix whose entries are and let . Let denote vectors of entries and . A quick computation shows that

 minε∈PrepRSKME(ε) =min→β(→βTK→β−2∗→1/NTK→β) =min→β||L→1/N−L→β||22.

Of course the minimum is as mentioned before. This reasoning applies mutatis mutandis when : The minimum of

 min→βT∑k=1||Ltk→1/Nk−Ltk→β||22,

where now , , and has entries corresponding to and else, can be computed using the pseudo-inverse of .

As before the optimization over all elements of is an optimization over possible choices of initial density. The optimal solution corresponds to a list of coefficients and reflects the initial conditions . Given these initial conditions the solution of the PDE is unique.

A common situation in kernel density estimation is that the domain of data is known in advance, say . This leads to the topic of boundary conditions. The homogeneous Neumann boundary conditions

 ∂xρ(x,t)|x=0=∂xρ(x,t)|x=1=0

ensure that , which entails

 ∫Xρ(x,t)dx=∫Xρ(x,0)dx=1.
###### Example 14.

(Learning an evolving density on ) Suppose that two samples each of size are taken at times and from a density that evolves according to the heat equation with homogeneous Neumann boundary conditions on . For illustration we suppose that at time the density is the beta density and we construct an estimator for the density at that accounts for both samples.

The kernel of the heat equation under Neumann boundary conditions is

 KNeumBoundt(x,x′)=1+2∑n>0e−n2π2tcos(nπx)cos(nπx′).

The individual estimators for samples at times and are and , where . A combined estimator for the evolving density with coefficients is provided by Thm. 12, see Figure 2.2.

## 3. Conclusion

Time-dependent PDEs occur in countless situations throughout scientific disciplines. Kernel methods are commonly employed in PDE theory [Schaback2006] but the existing techniques mostly focus on the time-independent context. Similarly, static kernel methods are common in spatio-temporal modeling and fall under the general IDW paradigm. Models that accurately take account of the system dynamics are much less common. The article at hand introduces a new kernel-based paradigm for spatio-temporal modeling based on time-dependent kernels that realize the Fokker-Planck dynamics. A respective kernel-based learning theory for time-dependent PDEs is introduced and a representer theorem is provided for the application of dynamic kernel techniques.

Our kernel density estimators are closely related to the famous diffusion-estimator of [Botev2010]. The latter uses the Fokker-Planck equation as a resource for the construction of estimators under prior information, such as domain constrains and a pilot density estimate. On the theoretical side we expect applications of our method in the construction of diffusion estimators from multiple correlated samples along the lines of [Botev2010].

## 4. Appendix

###### Proof of Lemma 3.

Symmetry of the kernel, i.e. follows directly by plugging in. To show positive-definiteness set and compute which is positive-definite because . ∎

###### Proof of Thm. 5.

By definition , i.e.

 inff∈GRSemp(f)≤minf∈GrepRSemp(f).

Fix and let, by the Moore-Aronszajn theorem, be the unique reproducing kernel Hilbert space with kernel . Evaluating at simply gives . We view (for fixed ) the function as an element of . To emphasize this we introduce the notation , i.e. . Let be the projector in on

 K=span{Ktk(x,x(k)i)}i=1,...,N

and let be the projector onto the orthogonal complement . For any function , which is contained in the orthogonal complement of the reproducing property of gives

 0=⟨g(x)|Ktk(x,x(k)i)⟩Aronszajn=g(x(k)i).

This implies As a consequence we find that