# State Space representation of non-stationary Gaussian Processes

The state space (SS) representation of Gaussian processes (GP) has recently gained a lot of interest. The main reason is that it allows to compute GPs based inferences in O(n), where n is the number of observations. This implementation makes GPs suitable for Big Data. For this reason, it is important to provide a SS representation of the most important kernels used in machine learning. The aim of this paper is to show how to exploit the transient behaviour of SS models to map non-stationary kernels to SS models.

## Authors

• 18 publications
• 24 publications
05/24/2019

### Sequential Gaussian Processes for Online Learning of Nonstationary Functions

Many machine learning problems can be framed in the context of estimatin...
03/18/2021

### Data-Driven Wireless Communication Using Gaussian Processes

Data-driven paradigms are well-known and salient demands of future wirel...
12/25/2019

### Scalable Gaussian Process Regression for Kernels with a Non-Stationary Phase

The application of Gaussian processes (GPs) to large data sets is limite...
07/17/2018

### Mixed-Stationary Gaussian Process for Flexible Non-Stationary Modeling of Spatial Outcomes

Gaussian processes (GPs) are commonplace in spatial statistics. Although...
03/11/2020

### General linear-time inference for Gaussian Processes on one dimension

Gaussian Processes (GPs) provide a powerful probabilistic framework for ...
10/28/2020

### Hierarchical Gaussian Processes with Wasserstein-2 Kernels

We investigate the usefulness of Wasserstein-2 kernels in the context of...
06/18/2020

### Infinite attention: NNGP and NTK for deep attention networks

There is a growing amount of literature on the relationship between wide...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

In machine learning, Gaussian Processes (GP) are commonly used modelling tools for Bayesian non-parametric inference (O’Hagan and Kingman, 1978; Neal, 1998; MacKay, 1998; Rasmussen and Williams, 2006; Rasmussen, 2011; Gelman et al., 2013). For instance in GP regression,

, the aim is to estimate

from (noisy) observation . A natural Bayesian way to approach this problem is to place a prior on and use the observations to compute the posterior of . Since is a function, the GP is a natural prior distribution for (MacKay, 1998; Rasmussen and Williams, 2006). A GP, denoted as , is completely defined by its mean function (usually assumed to be zero) and covariance function (CF) (also called kernel)

, which depends on a vector of hyperparameters

. By suitably choosing the kernel function, we can make GPs very flexible and convenient modelling tools. However, a drawback with GPs is that the direct computation of the posterior of is computationally demanding. The computational cost is cubic, , in the number of observations. This makes GPs unsuitable for Big Data. Several general sparse approximation schemes have been proposed for this problem, (see for instance Quiñonero-Candela and Rasmussen (2005) and (Rasmussen and Williams, 2006, Ch. 8)).

In the case the function is defined on , i.e., , computational savings can be made by converting the GP into State Space (SS) form and make inference using Kalman filtering. Note that the case is particularly important because it includes time-series analysis ( is time). The connection between GPs and SS models is well known for some basic kernels and recently it has gained a lot of interest. Certain classes of stationary CFs can be directly converted into state space models by representing their spectral densities as rational functions (Särkkä and Hartikainen, 2012; Sarkka et al., 2013; Solin and Särkkä, 2014). Moreover, an explicit link between periodic (non-stationary) CFs and SS models has also been derived in (Solin and Särkkä, 2014).

The connection between the SS representation of GPs and GPs models used in machine learning is important for many reasons. First, in machine learning, it has been shown that GPs, with a suitable choice of the kernel, are universal function approximators (Rasmussen and Williams, 2006; Williams, 1997). Moreover, GPs can also be used for classification, with only a slight modification. Second, inferences in SS models can be computed efficiently by processing the observations sequentially. This means that the computational cost of inference in the SS representation of GPs is . Third, SS models represent GPs through Stochastic Differential Equations (SDE). They return a model that directly explains the time-series and not only fits or predicts it. It is well known that SDEs are basic modelling tools in econometrics, physics etc.. Hence, if we are able to map the most important kernels used in machine learning to the SS representation we can “kill three birds with one stone”, i.e., we can have an explanatory model with a universal function approximation property at a cost of . This is the aim of this paper. In particular, the goal is to extend the work (Särkkä and Hartikainen, 2012; Sarkka et al., 2013) by providing a SS representation of the most important kernels used in machine learning.

In particular, we will show that non-stationary kernels can be mapped into SS models by considering the transient

behaviour. It is well known that the time response of linear SS models is always the superposition of an initial-condition part and a driven part. The response due to initial-conditions is often ignored, because it vanishes in the stationary case (it is transient). However, for non-stationary systems, the transient never vanishes and, thus, it determines the behaviour of the system. Even with a zero initial condition, we can have a transient behaviour due to the driven part. We will show that by taking into account the transient, we can map the linear regression, periodic and spline kernel to SS models. Moreover, we will also study the transient behaviour for stationary systems to show that in this case it vanishes. To reconcile these two cases, we will make use of the Laplace transform that is able to account for the transient behaviour. This is a difference w.r.t. the work by

Särkkä and Hartikainen (2012); Sarkka et al. (2013)

where they employed the Fourier transform. Then we will show how to map the neural networks kernels to SS models. For this purpose we will use linear time-variant SS, that are intrinsically non-stationary. Finally, by means of simulations we will show the effectiveness of the proposed approach and the computational advantages by applying it to long time-series. In this work, for lack of space, we will assume that the reader is familiar with the machine learning representation of GPs and we will only discuss the SS representation.

## 2. State Space model

Let us consider the following stochastic linear time-variant (LTV) state space model (Jazwinski, 2007)

 {df(t)=F(t)f(t)dt+L(t)dw(t),y(tk)=C(tk)f(tk), (1)

where is the (stochastic) state vector, is the observation at time , is a one-dimensional Wiener process with intensity and are known time-variant matrices of appropriate dimensions. We further assume that the initial state and are independent for each . It is well know that the solution of the stochastic differential equation in (1) is (see for instance Jazwinski (2007)):

 f(tk)=ψ(tk,t0)f(t0)+tk∫t0ψ(tk,τ)L(τ)dw(τ), (2)

with is the state transition matrix, which is obtained as a matrix exponential.111The matrix exponential is .

###### Proposition 1.

Assume that , then the vector of observations

is Gaussian distributed with zero mean and covariance matrix whose elements are given by:

 E[y(ti)y(tj)]==C(ti)ψ(ti,t0)E[f(t0)fT(t0)](C(tj)ψ(tj,t0))T+min(ti,tj)∫t0h(ti,u)h(tj,u)q(u)du (3)

where we have exploited the fact that and defined ( is called impulse response).

The proof of this proposition is well known (see for instance Jazwinski (2007)), but we have reported the derivations of this proposition (and next propositions/theorems) in appendix for the convenience of the reader. From (3), it is evident that, given the time-varying matrices , the CF of a LTV system is completely defined by: (i) the covariance of the initial condition ; (ii) the CF of the noise .

In case the SS model is Linear Time-Invariant (LTI), i.e., , we can use the Laplace transform to derive (2) and (3) using only algebraic computations. The Laplace transform of a function , defined for , is:

 x(s)=∞∫0e−stx(t)dt,

where the parameter is the complex number , with and denoting the imaginary unit. The Laplace transform exists provided that the above integral is finite. The values of for which the Laplace transform exists are called the Region Of Convergence (ROC) of the Laplace transform. By using the Laplace transform, we can rewrite the differential equation in (1) in an algebraic form:

 sf(s)−f(t0)=Ff(s)+Lw(s), (4)

where are the Laplace transforms of .222By defining the Laplace transform (4), we are wrongly considering as a deterministic input. We use this notation only for convenience, but then we define the correct inverse Laplace transform in (6). Since , we have

 y(s) =C(sI−F)−1f(t0)+H(s)w(s), (5)

where is called the transfer function of the linear time-invariant (LTI) SS model and

is the identity matrix. Since a product in the Laplace domain corresponds to a convolution in time, it follows that

 y(tk)=L−1(C(sI−F)−1)f(t0)+∫tkt0h(tk−u)dw(u), (6)

where denotes the inverse Laplace transform. By computing , we obtain again (3). The output of both LTV and LTI systems is clearly completely defined by the SS matrices, the initial condition and the stochastic forcing term . The aim of the next sections is to show that by suitably choosing these three components we can obtain SS models whose CF coincides with the main kernels used in GPs.

## 3. Non-stationary CFs defined by LTI SS without forcing term

In this section, we will show that two important CFs used in GPs can be obtained by two LTI SS models without stochastic forcing term. Their output is therefore completely determined by the initial conditions and, thus, the CF they define is non-stationary (it depends on ).

### 3.1. Linear regression kernel

Assume without loss of generality that ,

 F= [0100], L=[01], C= [10], (7)

333Since , the matrix is completely superfluous. We have introduced it only because we will use this model later. and so there is not forcing term (). This corresponds to the following LTI SS model:

 ⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩df1dt(t)=f2(t),df2dt(t)=0,y(tk)=f1(tk). (8)

From the equations of this SS model, since we derive that (it is the function of interest) and is its derivative. By computing , we have that

 ψ(t,t0)=[1t−t001],

and, therefore, .

###### Proposition 2.

Consider the SS model in (8) and assume that is Gaussian distributed with zero mean and covariance and . From (2), we have that is Gaussian distributed with zero mean and CF

 E[y(ti)y(tj)]=σ21+σ22titj,

for each , which corresponds to the linear regression CF (Rasmussen and Williams, 2006, Sec. 4.2.2).

This connection between SS and linear regression is well-known, here we have shown how to derive the CF. Higher order linear regression CFs can be obtained by considering and

 F= [0m−1Im−100Tm−1], L=[0Tm−11] C= [0m−11], (9)

where is the zero vector.

### 3.2. Periodic Kernel

Consider the following LTI SS:

 F= [0ωk−ωk0],C=[10], (10)

with for all . By computing , we have that

 ψ(t,t0)=[  cos(ωk(t−t0))sin(ωk(t−t0))−sin(ωk(t−t0))cos(ωk(t−t0))],

and, so

 y(ti) =Cψ(ti,t0)f(t0) =cos(ωk(t−t0))f1(t0)+sin(ωk(t−t0))f2(t0) =akcos(ωk(t−t0))+bksin(ωk(t−t0)), (11)

where we have written and .

###### Proposition 3.

Consider the SS model in (10) and assume that

are Gaussian distributed with zero mean, variances

, and uncorrelated . From (2), we have that is Gaussian distributed with zero mean and CF

 E[y(t1)y(t2)] =E[a2k]cos(ωk(t1−t0))cos(ωk(t2−t0)) +E[b2k]sin(ωk(t1−t0))sin(ωk(t2−t0)). =E[a2k]cos(ωk(t2−t1))

for each , where last equality holds if .

This SS model defines a periodic CF. However, it is evident (see in particular (11)) that it can only represent sinusoidal type periodic functions. However, we know that any periodic function in , that is integrable, can be approximated by Fourier series:

 f(t) ≈J∑k=1[akcos(2πktpe)+bksin(2πktpe)], (12)

where are the coefficients of the Fourier series, which depend on , and is the order of the approximation.444We can also include the constant term () in the series. Therefore, we can approximate any periodic function by a sum of SS models of type (10) with . For instance, for , we can consider the SS model

 F= ⎡⎢ ⎢ ⎢⎣0ω100−ω1000000ω200−ω20⎤⎥ ⎥ ⎥⎦,C=[1010]. (13)

Its transfer function is a diagonal block matrix with blocks

 [  cos(ωk(t−t0))sin(ωk(t−t0))−sin(ωk(t−t0))cos(ωk(t−t0))],

for . Hence, from (3), we have that

 E[y(ti)y(tj)]=2∑k=1E[a2k]cos(ωk(ti−t0))cos(ωk(tj−t0))+E[b2k]sin(ωk(ti−t0))sin(ωk(tj−t0)), (14)

where we have assumed that is a block diagonal matrix with blocks

 [E[a2k]00E[b2k]], (15)

for . In Fourier series, the function is known and so the coefficients can be computed based on . In GP regression, we do not know and so we do not know . We must estimate these coefficients from data.555When the period is unknown, we can estimate it from data. This is the reason we have assumed a prior distribution on these coefficients. This prior is completely defined by the variances for . If we further assume that then we have only parameters to specify, the . We can further assume that all the parameters are functions of a single parameter ; and penalize high order frequencies. For instance, if we choose , it can be shown that

 E[y(ti)y(tj)]=J∑k=0q2kcos(2πkpe(tj−ti)) J→∞−−−→exp⎛⎝−2ℓ2sin(π(tj−ti)pe)2⎞⎠,

which is the periodic CF used in GPs (Rasmussen and Williams, 2006, Sec. 4.2.3). This result has been derived by Solin and Särkkä (2014), here we have highlighted more extensively the transient analysis and the connection with the Fourier series.

## 4. Non-stationary CFs defined by LTI SS with zero initial conditions

In this section, we will show that an important CF used in GPs can obtained by a LTI SS model with stochastic forcing term and zero initial conditions.

### 4.1. Spline kernel

We derive the CF for the cubic smoothing splines. Consider the LTI SS model in (7), but this time assume that , and , then

 y(ti) =ti∫0Cψ(ti,τ)L(τ)dw(τ)=ti∫0(ti−τ)dw(τ)

and so, from (2), .

###### Proposition 4.

Consider the LTI SS model in (7), but this time assume that and . Then, from (2), we derive that is Gaussian distributed with zero mean and CF

 E[y(ti)y(tj)] =|ti−tj|min(ti,tj)22+min(ti,tj)33, (17)

for each , which is the kernel for the cubic smoothing splines (Rasmussen and Williams, 2006, Sec. 6.3.1).

Higher order splines can be obtained considering SS models as in (9). The connection between SS and smoothing splines has been first derived in Kohn and Ansley (1987). Here, we have highlighted the connection with GPs and computed the CF (17).

## 5. Stationary CFs defined by LTI SS models

A stationary CF is a CF that only depends on . In this section we will present SS models whose CF corresponds to stationary kernels used in GPs. Since stationary CFs satisfy for , i.e., they are even functions defined on , it is convenient to introduce the bilateral Laplace transform:

 B(f(τ))=∞∫−∞e−sτf(τ)dτ,

that is defined for functions that take values in . When the ROC includes the imaginary axis, then for , the bilateral Laplace transform reduces to the Fourier transform. Assume that the CF is stationary. The Wiener-Khintchine theorem (Chatfield, 2013) states that the CF is completely defined by its Fourier transform (its bilateral Laplace transform computed for ) (when it exists) and vice versa:

 k(τ)=∫RSy(ιω)e2πιωτdω,  Sy(ιω)=∫Rk(τ)e−2πιωτdτ,

where is the Fourier transform of also called the spectral density.

###### Theorem 1 (Representation theorem).

Define and assume that that is a (proper) rational function of . Then there exists a stable LTI system with the impulse response such that

 y(t)=∫t−∞h(t−u)dw(u) (18)

where is a stationary process with uncorrelated increments and spectral density . In the Laplace domain this can be written as

 Sy(s)=H(s)H(−s)q, (19)

where is a rational function whose denominator has all roots with negative real parts, has no roots with positive real part and is a positive real constant.666Since is a real even functions, the zeroes and poles are symmetric w.r.t. the real axis and mirrored in the imaginary axis.

This is standard result for LTI systems. By interpreting as the Bilateral Laplace transform of the noise (), then Equation (19) relates the spectral functions of the output of a SS model described by the transfer function with that of the input (the noise ) through the transfer function of the SS model . We can use (19) to derive the SS model that is related to . These are the steps: (i) find the zeroes and poles of ; (ii) take all the poles with real negative part and zeroes with non-positive real part; (iii) decompose as

 H(s)=∏i(s−zi)∏i(s−pi)=b0sn+b1sn−1+⋯+bn−1s+bnsn+a1sn−1+⋯+an−1s+an,

and derive the (observable canonical) LTI SS model:

 F =⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣010⋯0001⋯0⋮⋮0⋱000⋯⋮1−an−an−1−an−2⋯−a1⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦,L=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣r1r2⋮rn−1rn⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦ C =[100⋯0], (20)

where

 ⎡⎢ ⎢ ⎢ ⎢⎣r0r1⋮rn⎤⎥ ⎥ ⎥ ⎥⎦=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣1a11⋯⋱an⋯a11⎤⎥ ⎥ ⎥ ⎥ ⎥⎦−1⎡⎢ ⎢ ⎢ ⎢⎣b0b1⋮bn⎤⎥ ⎥ ⎥ ⎥⎦

The above LTI system has CF .

### 5.1. Matérn kernel for ν=3/2

Let us consider again the stationary Matérn kernel for , i.e., for (Rasmussen and Williams, 2006, Sec. 4.2.1). The bilateral Laplace transform of is

 B(R(τ))=1(λ−s)2(λ+s)2,

which is a rational function. We can then apply Theorem 1 and, in this case, and . Using (20) we can derive the SS model:

 F= [01−λ2−2λ], L=[01] C= [10], (21)

. Its impulse response is for .

###### Proposition 5.

Consider (21) and assume that . Then, from (2), we obtain that is Gaussian distributed with zero mean and CF

 E[y(ti)y(tj)]=e−λ|tj−ti|(1+λ|tj−ti|)4λ3 −e−λ(−2t0+ti+tj)(1+λ(−2t0+ti+tj)+λ2(t0−ti)(t0−tj))4λ3

for each .

It can be observed that for , the second term goes to zero and and we obtain the Matérn CFs for and . Other Matérn CFs for can be obtained via LTI SS models in a similar way. This connection between LTI SS models and Matérn kernel has been firstly discussed in Sarkka et al. (2013). Here, we have reported the CF for finite (non-stationary case) and shown that the CF becomes stationary for .

### 5.2. Square Exponential kernel

Let us now consider the stationary square exponential kernel for (Rasmussen and Williams, 2006, Sec. 4.2.1). Its bilateral Laplace transform is , whose ROC is and, hence, . is not a rational function, so we cannot directly apply Theorem 1. However, it is a real-valued positive function (it is positive and non-zero) and it is analytic, so we can approximate with its Taylor expansion in zero and obtain:

 Sy(s) ≈=√2πℓd!2d1d!2d+d!2d−1ℓ2s2+⋯+ℓ2ds2d. (22)

We can find by determining the roots of the polynomial at the denominator that have negative real part. This can be done numerically, but the next result is useful.

###### Theorem 2.

The roots of the denominator in (22) can be obtained by computing the roots for and then dividing them for . This gives , while .

This result is useful because it allows us to compute off-line the solutions of (22) even when the hyperparameter is unknown. Let us consider for instance the case

 Sy(s)≈8√2πℓ1ω4ℓ4+4ω2ℓ2+8.

The roots of the denominator for are .777By dividing them for we obtain the roots for . Hence, the roots corresponding to the stable part are and so

 H(s)=1(s+a)2+(b)2,

with and , which can be represented by the following LTI SS model:

 F= [01−(a2+b2)−2a], L=[01] C= [10], (23)

with . The inverse Laplace transform of is .

###### Proposition 6.

Consider (23) and assume that . Then, from (2), we obtain that is Gaussian distributed with zero mean and CF

 E[y(ti)y(tj)] =8√2π min(ti,tj)∫t0exp(−a(ti−u))sin(b(ti−u))b ⋅exp(−a(tj−u))sin(b(tj−u))bdu

for each . For , it only depends on .

This connection between LTI SS models and squared exponential kernels has been derived in (Sarkka et al., 2013) and in (Sarkka and Piché, 2014) they have studied the convergence property of the Taylor series. Here, we have derived the new result in Theorem 2, computed the CF for a finite (non-stationary case) and shown that the CF becomes stationary for (see appendix for the proof).

## 6. LTV systems

Up to now, we have only worked with LTI SS models. Hereafter, we will show that moving from LTI to LTV allows us to map two fundamental non-stationary kernels (Williams, 1997) to SS models.

### 6.1. Non-stationary SE kernel

Let us consider this CF:

 E[y(t1)y(t2)] =e−12σ2mt2ie−12σ2s(ti−tj)2e−12σ2mt2j (24)

where are hyper-parameters. This CF is clearly non-stationary. However, its central term is the square exponential CF and, therefore, it can be approximated by a LTV SS that is a simple variant of the LTI SS that defines the square exponential CF. For instance, for this LTV SS model is:

 F= [01−(a2+b2)−2a], L=[01] C= [e−t2/20], (25)

The impulse response is for and the CF

 E[y(ti)y(tj)] =8√2πe−t21/2e−t22/2 min(ti,tj)∫t0exp(−a(ti−u))sin(b(ti−u))b ⋅exp(−a(tj−u))sin(b(tj−u))bdu

This is the approximation. For , the integral term depends only on . This CF is used in neural networks research and the connection with GPs has been discussed by Williams (1997).

### 6.2. LTV system defining the neural network kernel

Let us consider the following LTV SS model

 F= [0e−tTΣ1/2r00],C=[10], (26)

with , , and is a definite positive definite matrix. Its transfer function is

 ϕ(ti,t0)=⎡⎣1√π(erf(tiTΣ1/2r)−erf(tT0Σ1/2r))2r1σ101⎤⎦.

Hence, we have that

 y(ti)=Cψ(ti,t0)f(t0)=f1(t0)+√π(erf(tiTΣ1/2r)−erf(tT0Σ1/2r))2r1σ1f2(t0) (27)

and

 E[y(ti)y(tj)]==C(ti)ψ(ti,t0)E[f(t0)fT(t0)](C(tj)ψ(tj,t0))T=erf(tTiΣ1/2r)erf(tTjΣ1/2r), (28)

where we have assumed that is Gaussian distributed with zero mean and covariance matrix . Therefore, from (27), it is evident that (26) models erf like functions. We can add expressivity to the model as follows

 F= ⎡⎢ ⎢ ⎢ ⎢⎣0e−tTΣ1/2r1000000000e−tTΣ1/2r20000⎤⎥ ⎥ ⎥ ⎥⎦, C= [1010]/√2.

Its transfer function is a diagonal block matrix with elements

 ⎡⎣1√π(erf(tiTΣ1/2rk)−erf(tT0Σ1/2rk))2r1kσ101⎤⎦,

for . Hence, we have that

 y(ti)=1√22∑k=1f1k(t0)+√π(% erf(tiTΣ1/2rk)−erf(tT0Σ1/2rk))2r1kσ1f2k(t0), (29)

where