In machine learning, Gaussian Processes (GP) are commonly used modelling tools for Bayesian non-parametric inference (O’Hagan and Kingman, 1978; Neal, 1998; MacKay, 1998; Rasmussen and Williams, 2006; Rasmussen, 2011; Gelman et al., 2013). For instance in GP regression,
, the aim is to estimatefrom (noisy) observation . A natural Bayesian way to approach this problem is to place a prior on and use the observations to compute the posterior of . Since is a function, the GP is a natural prior distribution for (MacKay, 1998; Rasmussen and Williams, 2006). A GP, denoted as , is completely defined by its mean function (usually assumed to be zero) and covariance function (CF) (also called kernel) . By suitably choosing the kernel function, we can make GPs very flexible and convenient modelling tools. However, a drawback with GPs is that the direct computation of the posterior of is computationally demanding. The computational cost is cubic, , in the number of observations. This makes GPs unsuitable for Big Data. Several general sparse approximation schemes have been proposed for this problem, (see for instance Quiñonero-Candela and Rasmussen (2005) and (Rasmussen and Williams, 2006, Ch. 8)).
In the case the function is defined on , i.e., , computational savings can be made by converting the GP into State Space (SS) form and make inference using Kalman filtering. Note that the case is particularly important because it includes time-series analysis ( is time). The connection between GPs and SS models is well known for some basic kernels and recently it has gained a lot of interest. Certain classes of stationary CFs can be directly converted into state space models by representing their spectral densities as rational functions (Särkkä and Hartikainen, 2012; Sarkka et al., 2013; Solin and Särkkä, 2014). Moreover, an explicit link between periodic (non-stationary) CFs and SS models has also been derived in (Solin and Särkkä, 2014).
The connection between the SS representation of GPs and GPs models used in machine learning is important for many reasons. First, in machine learning, it has been shown that GPs, with a suitable choice of the kernel, are universal function approximators (Rasmussen and Williams, 2006; Williams, 1997). Moreover, GPs can also be used for classification, with only a slight modification. Second, inferences in SS models can be computed efficiently by processing the observations sequentially. This means that the computational cost of inference in the SS representation of GPs is . Third, SS models represent GPs through Stochastic Differential Equations (SDE). They return a model that directly explains the time-series and not only fits or predicts it. It is well known that SDEs are basic modelling tools in econometrics, physics etc.. Hence, if we are able to map the most important kernels used in machine learning to the SS representation we can “kill three birds with one stone”, i.e., we can have an explanatory model with a universal function approximation property at a cost of . This is the aim of this paper. In particular, the goal is to extend the work (Särkkä and Hartikainen, 2012; Sarkka et al., 2013) by providing a SS representation of the most important kernels used in machine learning.
In particular, we will show that non-stationary kernels can be mapped into SS models by considering the transient
behaviour. It is well known that the time response of linear SS models is always the superposition of an initial-condition part and a driven part. The response due to initial-conditions is often ignored, because it vanishes in the stationary case (it is transient). However, for non-stationary systems, the transient never vanishes and, thus, it determines the behaviour of the system. Even with a zero initial condition, we can have a transient behaviour due to the driven part. We will show that by taking into account the transient, we can map the linear regression, periodic and spline kernel to SS models. Moreover, we will also study the transient behaviour for stationary systems to show that in this case it vanishes. To reconcile these two cases, we will make use of the Laplace transform that is able to account for the transient behaviour. This is a difference w.r.t. the work bySärkkä and Hartikainen (2012); Sarkka et al. (2013)
where they employed the Fourier transform. Then we will show how to map the neural networks kernels to SS models. For this purpose we will use linear time-variant SS, that are intrinsically non-stationary. Finally, by means of simulations we will show the effectiveness of the proposed approach and the computational advantages by applying it to long time-series. In this work, for lack of space, we will assume that the reader is familiar with the machine learning representation of GPs and we will only discuss the SS representation.
2. State Space model
Let us consider the following stochastic linear time-variant (LTV) state space model (Jazwinski, 2007)
where is the (stochastic) state vector, is the observation at time , is a one-dimensional Wiener process with intensity and are known time-variant matrices of appropriate dimensions. We further assume that the initial state and are independent for each . It is well know that the solution of the stochastic differential equation in (1) is (see for instance Jazwinski (2007)):
with is the state transition matrix, which is obtained as a matrix exponential.111The matrix exponential is .
Assume that , then the vector of observations
is Gaussian distributed with zero mean and covariance matrix whose elements are given by:
where we have exploited the fact that and defined ( is called impulse response).
The proof of this proposition is well known (see for instance Jazwinski (2007)), but we have reported the derivations of this proposition (and next propositions/theorems) in appendix for the convenience of the reader. From (3), it is evident that, given the time-varying matrices , the CF of a LTV system is completely defined by: (i) the covariance of the initial condition ; (ii) the CF of the noise .
In case the SS model is Linear Time-Invariant (LTI), i.e., , we can use the Laplace transform to derive (2) and (3) using only algebraic computations. The Laplace transform of a function , defined for , is:
where the parameter is the complex number , with and denoting the imaginary unit. The Laplace transform exists provided that the above integral is finite. The values of for which the Laplace transform exists are called the Region Of Convergence (ROC) of the Laplace transform. By using the Laplace transform, we can rewrite the differential equation in (1) in an algebraic form:
where are the Laplace transforms of .222By defining the Laplace transform (4), we are wrongly considering as a deterministic input. We use this notation only for convenience, but then we define the correct inverse Laplace transform in (6). Since , we have
where is called the transfer function of the linear time-invariant (LTI) SS model and
is the identity matrix. Since a product in the Laplace domain corresponds to a convolution in time, it follows that
where denotes the inverse Laplace transform. By computing , we obtain again (3). The output of both LTV and LTI systems is clearly completely defined by the SS matrices, the initial condition and the stochastic forcing term . The aim of the next sections is to show that by suitably choosing these three components we can obtain SS models whose CF coincides with the main kernels used in GPs.
3. Non-stationary CFs defined by LTI SS without forcing term
In this section, we will show that two important CFs used in GPs can be obtained by two LTI SS models without stochastic forcing term. Their output is therefore completely determined by the initial conditions and, thus, the CF they define is non-stationary (it depends on ).
3.1. Linear regression kernel
Assume without loss of generality that ,
333Since , the matrix is completely superfluous. We have introduced it only because we will use this model later. and so there is not forcing term (). This corresponds to the following LTI SS model:
From the equations of this SS model, since we derive that (it is the function of interest) and is its derivative. By computing , we have that
and, therefore, .
This connection between SS and linear regression is well-known, here we have shown how to derive the CF. Higher order linear regression CFs can be obtained by considering and
where is the zero vector.
3.2. Periodic Kernel
Consider the following LTI SS:
with for all . By computing , we have that
where we have written and .
This SS model defines a periodic CF. However, it is evident (see in particular (11)) that it can only represent sinusoidal type periodic functions. However, we know that any periodic function in , that is integrable, can be approximated by Fourier series:
where are the coefficients of the Fourier series, which depend on , and is the order of the approximation.444We can also include the constant term () in the series. Therefore, we can approximate any periodic function by a sum of SS models of type (10) with . For instance, for , we can consider the SS model
Its transfer function is a diagonal block matrix with blocks
for . Hence, from (3), we have that
where we have assumed that is a block diagonal matrix with blocks
for . In Fourier series, the function is known and so the coefficients can be computed based on . In GP regression, we do not know and so we do not know . We must estimate these coefficients from data.555When the period is unknown, we can estimate it from data. This is the reason we have assumed a prior distribution on these coefficients. This prior is completely defined by the variances for . If we further assume that then we have only parameters to specify, the . We can further assume that all the parameters are functions of a single parameter ; and penalize high order frequencies. For instance, if we choose , it can be shown that
which is the periodic CF used in GPs (Rasmussen and Williams, 2006, Sec. 4.2.3). This result has been derived by Solin and Särkkä (2014), here we have highlighted more extensively the transient analysis and the connection with the Fourier series.
4. Non-stationary CFs defined by LTI SS with zero initial conditions
In this section, we will show that an important CF used in GPs can obtained by a LTI SS model with stochastic forcing term and zero initial conditions.
4.1. Spline kernel
We derive the CF for the cubic smoothing splines. Consider the LTI SS model in (7), but this time assume that , and , then
and so, from (2), .
5. Stationary CFs defined by LTI SS models
A stationary CF is a CF that only depends on . In this section we will present SS models whose CF corresponds to stationary kernels used in GPs. Since stationary CFs satisfy for , i.e., they are even functions defined on , it is convenient to introduce the bilateral Laplace transform:
that is defined for functions that take values in . When the ROC includes the imaginary axis, then for , the bilateral Laplace transform reduces to the Fourier transform. Assume that the CF is stationary. The Wiener-Khintchine theorem (Chatfield, 2013) states that the CF is completely defined by its Fourier transform (its bilateral Laplace transform computed for ) (when it exists) and vice versa:
where is the Fourier transform of also called the spectral density.
Theorem 1 (Representation theorem).
Define and assume that that is a (proper) rational function of . Then there exists a stable LTI system with the impulse response such that
where is a stationary process with uncorrelated increments and spectral density . In the Laplace domain this can be written as
where is a rational function whose denominator has all roots with negative real parts, has no roots with positive real part and is a positive real constant.666Since is a real even functions, the zeroes and poles are symmetric w.r.t. the real axis and mirrored in the imaginary axis.
This is standard result for LTI systems. By interpreting as the Bilateral Laplace transform of the noise (), then Equation (19) relates the spectral functions of the output of a SS model described by the transfer function with that of the input (the noise ) through the transfer function of the SS model . We can use (19) to derive the SS model that is related to . These are the steps: (i) find the zeroes and poles of ; (ii) take all the poles with real negative part and zeroes with non-positive real part; (iii) decompose as
and derive the (observable canonical) LTI SS model:
The above LTI system has CF .
5.1. Matérn kernel for
Let us consider again the stationary Matérn kernel for , i.e., for (Rasmussen and Williams, 2006, Sec. 4.2.1). The bilateral Laplace transform of is
. Its impulse response is for .
It can be observed that for , the second term goes to zero and and we obtain the Matérn CFs for and . Other Matérn CFs for can be obtained via LTI SS models in a similar way. This connection between LTI SS models and Matérn kernel has been firstly discussed in Sarkka et al. (2013). Here, we have reported the CF for finite (non-stationary case) and shown that the CF becomes stationary for .
5.2. Square Exponential kernel
Let us now consider the stationary square exponential kernel for (Rasmussen and Williams, 2006, Sec. 4.2.1). Its bilateral Laplace transform is , whose ROC is and, hence, . is not a rational function, so we cannot directly apply Theorem 1. However, it is a real-valued positive function (it is positive and non-zero) and it is analytic, so we can approximate with its Taylor expansion in zero and obtain:
We can find by determining the roots of the polynomial at the denominator that have negative real part. This can be done numerically, but the next result is useful.
The roots of the denominator in (22) can be obtained by computing the roots for and then dividing them for . This gives , while .
This result is useful because it allows us to compute off-line the solutions of (22) even when the hyperparameter is unknown. Let us consider for instance the case
The roots of the denominator for are .777By dividing them for we obtain the roots for . Hence, the roots corresponding to the stable part are and so
with and , which can be represented by the following LTI SS model:
with . The inverse Laplace transform of is .
This connection between LTI SS models and squared exponential kernels has been derived in (Sarkka et al., 2013) and in (Sarkka and Piché, 2014) they have studied the convergence property of the Taylor series. Here, we have derived the new result in Theorem 2, computed the CF for a finite (non-stationary case) and shown that the CF becomes stationary for (see appendix for the proof).
6. LTV systems
Up to now, we have only worked with LTI SS models. Hereafter, we will show that moving from LTI to LTV allows us to map two fundamental non-stationary kernels (Williams, 1997) to SS models.
6.1. Non-stationary SE kernel
Let us consider this CF:
where are hyper-parameters. This CF is clearly non-stationary. However, its central term is the square exponential CF and, therefore, it can be approximated by a LTV SS that is a simple variant of the LTI SS that defines the square exponential CF. For instance, for this LTV SS model is:
The impulse response is for and the CF
This is the approximation. For , the integral term depends only on . This CF is used in neural networks research and the connection with GPs has been discussed by Williams (1997).
6.2. LTV system defining the neural network kernel
Let us consider the following LTV SS model
with , , and is a definite positive definite matrix. Its transfer function is
Hence, we have that
where we have assumed that is Gaussian distributed with zero mean and covariance matrix . Therefore, from (27), it is evident that (26) models erf like functions. We can add expressivity to the model as follows
Its transfer function is a diagonal block matrix with elements
for . Hence, we have that