V-Splines and Bayes Estimate

by   Zhanglong Cao, et al.

Smoothing splines can be thought of as the posterior mean of a Gaussian process regression in a certain limit. By constructing a reproducing kernel Hilbert space with an appropriate inner product, the Bayesian form of the V-spline is derived when the penalty term is a fixed constant instead of a function. An extension to the usual generalized cross-validation formula is utilized to find the optimal V-spline parameters.



There are no comments yet.


page 1

page 2

page 3

page 4


Adaptive Smoothing V-Spline for Trajectory Reconstruction

Trajectory reconstruction is the process of inferring the path of a movi...

Constrained Optimal Smoothing and Bayesian Estimation

In this paper, we extend the correspondence between Bayesian estimation ...

Asymptotic Bounds for Smoothness Parameter Estimates in Gaussian Process Interpolation

It is common to model a deterministic response function, such as the out...

Gaussian Process Regression in the Flat Limit

Gaussian process (GP) regression is a fundamental tool in Bayesian stati...

The theory and application of penalized methods or Reproducing Kernel Hilbert Spaces made easy

The popular cubic smoothing spline estimate of a regression function ari...

Symplectic Gaussian Process Regression of Hamiltonian Flow Maps

We present an approach to construct appropriate and efficient emulators ...

On a phase transition in general order spline regression

In the Gaussian sequence model Y= θ_0 + ε in R^n, we study the fundament...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A Hilbert space is a real or complex inner product space with respect to the distance function induced by the inner product [Dieudonné, 2013]. In particular, the Hilbert space is the set of square integrable functions , where all functions satisfy


with an inner product .

Consider a regression problem with observations modeled as , , where are i.i.d. Gaussian noise and . The classic nonparametric or semi-parametric regression is a function that minimizes the following penalized sum of squared functional


where the first term is the lack of fit of to the data. The parameter in the second term is a fixed smoothing parameter controlling the trade-off between over-fitting and bias [Hastie et al., 2009]. The minimizer of the above equation resides in an -dimensional space and the computation in multivariate settings is generally of the order [Kim and Gu, 2004]. Schoenberg [1964] shows that a piecewise polynomial smoothing spline of degree provides an aesthetically satisfying method for estimating if

cannot be interpolated exactly by some polynomial of degree less than

. For instance, when , a piecewise cubic smoothing spline provides a powerful tool to estimate the above nonparametric function, in which the penalty term is [Hastie and Tibshirani, 1990].

Further, Kimeldorf and Wahba [1971, 1970] explore the corresponding between smoothing spline and Bayesian estimation. Actually, Wahba [1978] shows that a Bayesian version of this problem is to take a Gaussian process prior on with being a zero-mean Gaussian process whose

th derivative is scaled white noise,

[Speckman and Sun, 2003]. The extended Bayes estimates with a “partially diffuse” prior is exactly the same as the spline solution. Heckman and Woodroofe [1991]

show that if prior distribution of the vector

is unknown but lies in a known class , the estimator is found by minimizing the . Branson et al. [2017]

propose a Gaussian process regression method that acts as a Bayesian analog to local linear regression for sharp regression discontinuity designs. It is no doubt that one of the attractive features of the Bayesian approach is that, in principle, one can solve virtually any statistical decision or inference problem. Particularly, one can provide an accuracy assessment for

using posterior probability regions

[Cox, 1993].

The problem of choosing the smoothing parameter is ubiquitous in curve estimation, and there are two different philosophical approaches to this question. The first one is to regard the free choice of smoothing parameter as an advantageous feature of the procedure. The other one is to find the parameter automatically by the data [Green and Silverman, 1993]. We prefer the latter one, to use data to train our model and to find the best parameters. The most well-known method is cross-validation.

Assuming that mean of the random errors is zero, the true regression curve has the property that, if an observation is taken away at a point , the value is the best predictor of in terms of returning a least value of .

Now, focus on an observation at point as being a new observation by omitting it from the set of data, which are used to estimate . Denote by the estimated function from the remaining data, where is the smoothing parameter. Then is the minimizer of


and can be quantified by the cross-validation score function


The basis idea of the cross-validation is to choose the value of that minimizes [Green and Silverman, 1993].

An efficient way to calculate the cross-validation score is introduced by Green and Silverman [1993]. Because of the value of the smoothing spline depending linearly on the data , define the matrix , which is a map vector of observed values to predicted value . Then we have


Based on the correspondence between nonparametric regression and Bayesian estimation, Craven and Wahba [1978] propose a generalized cross-validation estimate for the minimizer . The estimated is the minimizer of the function where the trace of matrix in (5) is incorporated. It is also possible to establish an optimal convergence property for the estimator when the number of observations in a fixed interval tends to infinity [Wecker and Ansley, 1983]. A highly efficient algorithm to optimize generalized cross-validation and generalized maximum likelihood scores with multiple smoothing parameters via the Newton method was proposed by Gu and Wahba [1991]. This algorithm can also be applied to maximum likelihood estimation and restricted maximum likelihood estimation. The behavior of the optimal regularization parameter in different regularization methods was investigated by Wahba and Wang [1990].

In this paper, we prove that the V-spline, which incorporates both and but penalizes excessive in the penalty term, can be estimated by a Bayesian approach in a certain reproducing kernel Hilbert space. An extended GCV is used to find the optimal parameters for the V-spline.

2 Polynomial Smoothing Splines on as Bayes Estimates

A polynomial smoothing spline of degree is a piecewise polynomial of the same degree on each interval , , and the first derivatives are continuous at the joined points. For instance, when , a piecewise cubic smoothing spline is a special case of the polynomial smoothing spline providing a powerful tool to estimate the objective function (2) in the space , where the penalty term is [Hastie and Tibshirani, 1990, Wang, 1998]. If a general space is equipped with an appropriate inner product, it can be made as a reproducing kernel Hilbert space.

2.1 Polynomial Smoothing Spline

A spline is a numeric function that is piecewise-defined by polynomial functions, which possesses a high degree of smoothness at the places where the polynomial pieces connect (known as knots) [Judd, 1998, Chen, 2017]. Suppose we are given a set of paired data on the interval , satisfying . A piecewise polynomial function can be obtained by dividing the interval into contiguous intervals and represented by a separate polynomial on each interval. For any continuous , it can be represented in a linear combination of basis functions as , where are coefficients [Ellis et al., 2009]. It is just like every vector in a vector space can be represented as a linear combination of basis vectors.

A smoothing polynomial spline is uniquely the smoothest function that achieves a given degree of fidelity to a particular data set [Whittaker, 1922]. In deed, the minimizer of function (2) is the curve estimate over all spline functions with continuous derivatives fitting observed data in the space . In fact, the representer theorem [Kimeldorf and Wahba, 1971] tells us that the function can be represented in the form , a linear combination of a positive-definite real-valued kernel

at each point. This is true for any arbitrary loss function

[Rudin, 2005].

Further, Wahba [1978] proves that if has the prior distribution which is the same as the distribution of the stochastic process on ,


where , is the integrated Wiener process, then the polynomial spline is the minimizer of the objective function (2) having the property that


with , where is expectation over the posterior distribution of with the prior (6). corresponds to the “diffuse” prior on .

2.2 Reproducing Kernel Hilbert Space in

Any has a standard Taylor expansion, which is


where . With an inner product


the representer is


It is easy to prove that is a non-negative reproducing kernel, by which . Additionally, for .

Before moving on to further steps, we are now introducing the following two theorems.

Theorem 1.

[Aronszajn, 1950] Suppose is a symmetric, positive definite kernel on a set . Then, there is a unique Hilbert space of functions on for which is a reproducing kernel.

Theorem 2.

[Gu, 2013] If the reproducing kernel of a space on domain can be decomposed into , where and are both non-negative definite, , for , and , for , then the spaces and corresponding respectively to and

form a tensor sum decomposition of

. Conversely, if and are both nonnegative definite and , then has a reproducing kernel .

According to Theorem 1, the Hilbert space associated with can be constructed as containing all finite linear combinations of the form , and their limits under the norm induced by the inner product . As for Theorem 2, it is easy to verify that corresponds to the space of polynomials with an inner product and corresponds to the orthogonal complement of , that is with an inner product .

Given a set of sampling points, any has the following form


where is a set of basis functions of space and is the representer in [Wang, 2011].

Additionally, the coefficients and might be changed when different and are used, but the function estimate remains the same regardless of the choices of and [Gu, 2013].

2.3 Polynomial Smoothing Splines as Bayes Estimates

Because it is possible to interpret the smoothing spline regression estimator as a Bayes estimate when the mean function is given an improper prior distribution [Wahba, 1990, Berlinet and Thomas-Agnan, 2011]. Therefore, one can find that the posterior mean of on with a vague improper prior is the polynomial smoothing spline of the objective function (2).

Consider on , with and having independent Gaussian priors with zero means and covariances satisfying


where and are from (10

). Because of the observations are normally distributed as

, then the joint distribution for

and is normal with mean of zero and covariance matrix of

where , , and . Consequently, the posterior is


where and . Furthermore, by denoting , Gu [2013] gives that, when , the posterior mean is in the form with coefficient vectors

Theorem 3.

[Gu, 2013] The polynomial smoothing spline of (2) is the posterior mean of , where diffuses in span and has a Gaussian process prior with mean zero and a covariance function


for .

Remark: Equation (12) can be obtained from equation (6) if we assume . Therefore the limit of indicates a diffuse prior for the coefficients .

2.4 Gaussian Process Regression

Gaussian processes are the extension of multivariate Gaussian to infinite-sized collections of real value variables, any finite number of which have a joint Gaussian distribution

[Rasmussen and Williams, 2006]

. Gaussian process regression is a probability distribution over functions. It is fully defined by its mean

and covariance function as


where and are two variables. A function distributed as such is denoted in form of


Usually the mean function is assumed to be zero everywhere.

Given a set of input variables for function and the output with i.i.d. Gaussian noise

of variance

, we can use the above definition to predict the value of the function at a particular input . As the noisy observations becoming


where is a Kronecker delta which is one if and only if and zero otherwise, the joint distribution of the observed outputs and the estimated output according to prior is


The posterior distribution over the predicted value is obtained by conditioning on the observed data




Therefore it can seen that the Bayesian estimation of a smoothing spline is a special format of Gaussian process regression with diffuse prior and the covariance matrix .

3 V-Splines and Bayes Estimate

3.1 V-Splines

In a nonparametric regression, consider paired time series points , , , such that , is the position information and indicates its velocity. As in [Silverman, 1985] and [Donoho et al., 1995], we use a positive penalty function in the following objective function rather than a constant in (3).

Given function and , define the objective function


where is the parameter that weights the residuals between and . We make a simple assumption that is a piecewise constant and adopts a constant value on interval for .

Theorem 4.

For , the objective function is minimized by a cubic spline that is unique and linear outside the knots.

A further minimizer of (26) is named V-spline, coming from the incorporation with velocity information and applications on vehicle and vessel tracking. It is the solution to the objective function (26), where an extra term for and an extra parameter are incorporated. The penalty parameter is a function varying on different domains. If is constant and , the V-spline degenerates to a conventional cubic smoothing spline consisting of a set of given basis functions.

However, the Bayes estimate for a polynomial smoothing spline requires a constant penalty parameter. For this constraint, it is assumed that stays the same on each subinterval in and named the solution “trivial V-spline”. In this section, we still use “V-spline” for sake of simplicity.

3.2 Reproducing Kernel Hilbert Space

The space is a set of functions whose th derivatives are square integrable on the domain . For a V-spline, it only requires . In fact, its second derivative is piecewise linear but is not necessarily continuous at the knots. Besides, if and only if is constant and , the second derivative is piecewise linear and continuous at the knots. Here we are introducing the space

in which the second derivative of any function is not necessarily continuous.

Given a sequence of paired data , the the minimizer of


in the space is a V-spline. Equipped with an appropriate inner product


the space is made a reproducing kernel Hilbert space. In fact, the representer is


It can be seen that , and . The two terms of the reproducing kernel , where


are both non-negative definite themselves.

According to Theorem 2, can correspond the space of polynomials with an inner product , and corresponds the orthogonal complement of


with inner product . Thus, and are two subspaces of the , and the reproducing kernel is .

Define a new notation . Obviously . Additionally, we have , and . Then, for any , we have


It can be seen that the first term , and the space spanned by the second term , denoted as , is a subspace of , and . Given the sample points , in equation (27) and noting that the space


is a closed linear subspace of . Then, we have a new space . Thus, the two new sub spaces in are and .

For any , it can be written as


where and , , are coefficients, and . Thus, by substituting to the equation (27), it can be written as


Because of orthogonality, , , . By denoting that

and noting that , , and , where , the above equation (36) can be written as


Note that only appears in the third term and is minimized at . Hence, a V-spline resides in the space of finite dimension. Thus, the solution to (27) is computed via the minimization of the first three terms in (37) with respect to , and .

3.3 Posterior of Bayes Estimates

In a general process, we know that , where is a covariance matrix. However, we are more interested in given measurements, which is


where is a Gaussian process prior. In fact, the covariance matrix is associated to the inner product .

Observing and , , the joint distribution of and is normal with mean zero and a covariance matrix can be found by the following