# Propriety of the reference posterior distribution in Gaussian Process regression

In a seminal article, Berger, De Oliveira and Sansó (2001) compare several objective prior distributions for the parameters of Gaussian Process regression models with isotropic correlation kernel. The reference prior distribution stands out among them insofar as it always leads to a proper posterior. They prove this result for rough correlation kernels - Spherical, Exponential with power q<2, Matérn with smoothness ν<1. This paper provides a proof for smooth correlation kernels - Exponential with power q=2, Matérn with smoothness ν≥ 1, Rational Quadratic.

## Authors

• 4 publications
• ### A Comprehensive Bayesian Treatment of the Universal Kriging parameters with Matérn correlation kernels

The Gibbs reference posterior distribution provides an objective full-Ba...
01/03/2018 ∙ by Joseph Muré, et al. ∙ 0

• ### A Comprehensive Bayesian Treatment of the Universal Kriging model with Matérn correlation kernels

The Gibbs reference posterior distribution provides an objective full-Ba...
01/03/2018 ∙ by Joseph Muré, et al. ∙ 0

• ### Gaussian Processes with Errors in Variables: Theory and Computation

Covariate measurement error in nonparametric regression is a common prob...
10/14/2019 ∙ by Shuang Zhou, et al. ∙ 0

• ### Probabilistic structure discovery in time series data

Existing methods for structure discovery in time series data construct i...
11/21/2016 ∙ by David Janz, et al. ∙ 0

• ### Multimodal Data Fusion of Non-Gaussian Spatial Fields in Sensor Networks

We develop a robust data fusion algorithm for field reconstruction of mu...
06/10/2019 ∙ by Pengfei Zhang, et al. ∙ 0

• ### On the Relationship between Online Gaussian Process Regression and Kernel Least Mean Squares Algorithms

We study the relationship between online Gaussian process (GP) regressio...
09/11/2016 ∙ by Steven Van Vaerenbergh, et al. ∙ 0

• ### Random walk kernels and learning curves for Gaussian process regression on random graphs

We consider learning on graphs, guided by kernels that encode similarity...
11/06/2012 ∙ by Matthew Urry, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In a very influential paper, BDOS01 pioneered the field of Objective Bayesian analysis of spatial models. Previous works [DOKS97, Ste99] had noted that commonly used noninformative priors sometimes failed to yield proper posteriors, but BDOS01 were the first to thoroughly investigate the issue. Among several prior distributions – truncated priors, vague priors, Jeffreys-rule and independence Jeffreys prior – they showed that the reference prior [Ber05] is the most satisfying choice for a default prior distribution. This is due in no small part to the fact that, in the wide variety of cases studied by BDOS01, it systematically yields a proper posterior distribution. In this article, we complete their proof of this property.

Section 2 describes the Gaussian Process models studied by BDOS01. Section 3 shows that the proof of reference posterior propriety provided by BDOS01 only applies to those with rough correlation kernels – Spherical, Exponential with power , Matérn with smoothness . Section 4 contains the core of this paper: a proof of Theorem 9 which asserts that the reference prior leads to a proper posterior for models with smoother correlation kernels – Exponential with power , Matérn with smoothness , Rational Quadratic.

The rest of the introduction illustrates the significance of the reference prior yielding a proper posterior.

For smooth one-dimensional parametric families, the reference prior coincides with the Jeffreys-rule prior [CB94]. For finite-dimensional smooth parametric families, the reference prior algorithm requires the user to define groups of dimensions of the parameter and rank them. The reference prior is then defined iteratively:

1. Compute the Jeffreys-rule prior on the lowest-ranking group of dimensions conditionally to all others.

2. Average the likelihood function over this prior.

3. Compute the Jeffreys-rule prior (based on the integrated likelihood function) on the second-lowest-ranking group of dimensions conditionally to all higher-ranking dimensions.

4. Average the integrated likelihood function over this second prior.

5. Continue the process until the Jeffreys-rule prior on the highest-ranking group of dimensions has been computed.

6. The reference prior is defined as the product of all successively computed priors.

In the cases studied by BDOS01, the dimensions of the parameter can be put in two natural categories: “location” and “covariance”, the latter being higher-ranking because if they were known, the model would be trivialized.

The process entails however a significant difficulty, which arises because the successively computed Jeffreys-rule priors are often improper. We do not consider this difficulty here, but it is touched upon in BDOS01 and more thoroughly discussed in RSH12. It can be avoided by using “asymptotic marginalization” [BB92] instead of “exact marginalization”, but BDOS01 found that the resulting “approximate” reference prior, unlike the “true” reference prior, does not always yield a proper posterior in the studied cases.

Because it is so difficult to obtain a satisfying default prior distribution which consistently yields a proper posterior, it is important to ascertain that the reference prior actually does. Indeed, a vast literature [paulo05, RSH12, KP12, RSS13, GWB18] builds upon BDOS01’s result and depends on it.

## 2 Setting

BDOS01 consider models of Gaussian Process regression, also known as Universal Kriging, with isotropic autocorrelation kernels. Because isotropy is key, define

as the usual Euclidean norm if applied to a vector and as the Frobenius norm if applied to a matrix. In Universal Kriging, an unknown mapping from a spatial domain

() to is assumed to be a realization of a Gaussian process . The mean function of the Gaussian process is assumed to belong to some known vector space of dimension . If is non-zero, once a basis of has been set, can be parametrized by such that .

is assumed in the model to be an isotropic Gaussian process based on an autocorrelation kernel . is a mapping such that for any positive integer and any collection of distinct points within , the symmetric matrix with -th element is a positive definite correlation matrix. Necessarily, .

The autocovariance function of the Gaussian process is , where is the autocorrelation kernel parametrized by and defined by , making

the variance of

for all .

Fix and a collection of distinct points . Let this collection be the design set, i.e. the set of points where is observed. is a Gaussian vector with mean vector and covariance matrix , where denotes the matrix with -th element . Table 1 provides the definition of several correlation kernels.

If is non-zero, let denote the matrix with -th element . [Note: if , then we adopt the convention that any term involving can be ignored.] Then . Denote by the observed value of the random vector . The likelihood function of the parameter triplet has the following expression:

 L(y|β,σ2,θ)=(12πσ2)n2|Σθ|−12exp{−12σ2(y−Hβ)⊤Σ−1θ(y−Hβ)}. (1)

In order for the model to be identifiable, assume that and that has rank .

BDOS01 derive the reference prior corresponding to the parameter ordering [if , is meaningless, so the ordering is ]. One can see [RSH12] that the reference prior corresponding with the ordering [if , ] is the same.

To express it conveniently, denote by the matrix [if , ]. Also fix , an matrix such that and is the null matrix. ’s columns form an orthonormal basis of the orthogonal complement of the subspace of spanned by the columns of [if , fix

as an orthogonal matrix, for instance

].

###### Proposition 1.

The reference prior with ordering is , where

 π(θ)∝ ⎷Tr[{(ddθΣθ)Σ−1θQθ}2]−1n−p[Tr{(ddθΣθ)Σ−1θQθ}]2. (2)

Denoting by , can also be written as:

 π(θ)∝ ⎷Tr[{(ddθΣWθ)(ΣWθ)−1}2]−1n−p[Tr{(ddθΣWθ)(ΣWθ)−1}]2. (3)
###### Proof.

The first assertion is from RSH12. The second is a consequence of Lemma 10. ∎

###### Proposition 2.

If , after marginalizing and out, we have

 L(y|θ)=∬L(y|β,σ2,θ)/σ2dβdσ2=⎛⎜ ⎜⎝2πn−p2Γ(n−p2)⎞⎟ ⎟⎠−1∣∣Σ−1θ∣∣12|∣∣H⊤Σ−1θH∣∣−12(y⊤Σ−1θQθy)n−p2. (4)

Alternatively, the integrated likelihood with can also be written

 L(y|θ)=⎛⎜ ⎜⎝2πn−p2Γ(n−p2)⎞⎟ ⎟⎠−1∣∣H⊤H∣∣12∣∣W⊤ΣθW∣∣−12(y⊤W(W⊤ΣθW)−1W⊤y)n−p2. (5)

If , the integrated likelihood is simply

 (6)
###### Proof.

The result for and the first result for are from BDOS01. Lemma 10 yields that

 W(W⊤ΣθW)−1W⊤=Σ−1θQθ. (7)

So all that remains to be proved is that . Choose an matrix with columns forming an orthonormal basis of the subspace of spanned by the columns of . is therefore an orthogonal matrix, so . Using Schur’s complement, we have

 |Σθ|=∣∣W⊤ΣθW∣∣∣∣P⊤Σθ(In−W(W⊤ΣθW)−1W⊤Σθ)P∣∣. (8)

Lemma 10 again yields the result. ∎

## 3 Smoothness of the correlation kernel

Lemma 2 of BDOS01 requires that correlation kernel and design set should be such that , where is the vector with entries all equal to 1, is a real-valued function such that , is a fixed nonsingular matrix and is a mapping from to the set of real matrices such that .

What makes this assumption restrictive is the condition that should be nonsingular, because it holds for rough correlation kernels only. For instance, as was noted by paulo05, it does not hold for the Squared Exponential correlation kernel.

For a given correlation kernel , is typically a matrix proportional to the matrix with entries , where depends on the smoothness of the correlation kernel but should in any case belong to the interval . This is because is equivalent to a constant times when .

Sch37 gives the following result (Theorem 4 in the original paper):

###### Theorem 3.

If , the quadratic form is nonsingular and its canonical representation contains one positive and negative squares.

This means that if the correlation kernel is rough enough to have , the assumption that is nonsingular is reasonable.

###### Corollary 4.

The matrix with entries with

is nonsingular and has one positive eigenvalue and

negative eigenvalues.

The picture is dramatically different when the correlation kernel is smooth enough to have . This happens as soon as is twice continuously differentiable. Gow85’s Theorem 6 implies the following results:

###### Theorem 5.

If is the dimension of , the smallest Euclidean subspace containing all points in the design set, then the matrix with entries has rank:

1. (one positive eigenvalue, negative eigenvalues, any other eigenvalue null) if all points in the design set lie on the surface of a hypersphere of ;

2. (one positive eigenvalue, negative eigenvalues, any other eigenvalue null) otherwise.

###### Corollary 6.

The matrix with entries has rank lower or equal to .

For all practical purposes, is much greater than , so the matrix is singular when .

Let us review the values of for correlation kernels listed in Table 1. Matérn correlation kernels [Mat86] [HS93] with smoothness parameter have , thus for , but for , . Spherical correlation kernels [Wac95] have . Power Exponential kernels [DOKS97] have equal to their power. This means that all Power Exponential kernels except the Squared Exponential correlation kernel have . In particular, the Exponential kernel (which is also the Matérn kernel with smoothness ) has , but the Squared Exponential kernel has . Rational Quadratic kernels [Yag87] have . For easy reference, the review is summarized in Table 2.

The above review justifies the claim in the abstract that the Squared Exponential kernel, Matérn kernels with smoothness and Rational Quadratic kernels require a proof of the reference posterior’s propriety.

## 4 Propriety of the reference posterior distribution

BDOS01 show that the reference posterior distribution on and conditionally to is proper.

In this section, we prove that the joint reference posterior distribution is proper for Matérn kernels with smoothness , Rational Quadratic kernels and the Squared Exponential kernel.

###### Proposition 7.

For Matérn kernels with smoothness , for Rational Quadratic kernels with parameter and for the Squared Exponential kernel, the “marginal” reference prior distribution defined by Proposition 1 has the following behavior.

1. When ,

 π(θ)={o(1)for Matérn kernels and the % Squared Exponential kernel;O(θν−1)for Rational Quadratic kernels . (9)
2. When ,

 π(θ)=⎧⎪⎨⎪⎩O(θ−1)for Matérn kernels;o(1)for Rational Quadratic kernels ;O(θ)for the Squared Exponential kernel . (10)
###### Proof.

Denoting any of these kernels by , is continuously differentiable.

If is Squared Exponential, . This also holds if is Matérn with smoothness (see AS64 9.6.28. and 9.7.2.). If is Rational Quadratic with parameter , .

Moreover, converges to when , so its inverse does too. The first assertion follows from these facts.

The second assertion is proved by combining Lemma 12 with Lemma 20/21/22 for Matérn, Rational Quadratic and Squared Exponential kernels respectively. ∎

Let be the ordered eigenvalues of .

###### Lemma 8.

For Rational Quadratic and Squared Exponential kernels and for Matérn kernels with smoothness

, there exists a hyperplane

of such that for every , when :

 (y⊤W(W⊤ΣθW)−1W⊤y)−1=O(vn−p(θ)). (11)

The proof of this lemma can be found in Appendix D. Combined with Equation (5), it implies that if the observation vector belongs to , then

 L(y|θ)2=n−p∏i=1O(vn−p(θ))vi(θ)=O(1)whenθ→+∞. (12)

In the following, when belongs to , we write that “ looks nondegenerate”. This terminology relies on the intuition that if the observation were to take some values within , it would be better explained by a degenerate Gaussian model. The most compelling example is that of a constant observation vector, for which the Kriging model would be grossly inappropriate.

###### Theorem 9.

For Matérn kernels with noninteger smoothness , for Rational Quadratic kernels and for the Squared Exponential kernel, regardless of the design set and of the mean function space, if looks nondegenerate, then the reference posterior distribution is proper.

###### Proof.

The first assertion of Proposition 7 implies the reference prior is integrable in the neighborhood of 0. Furthermore, when , so the reference posterior is integrable in the neighborhood of 0 as well.

All that remains to be proved is therefore that the reference posterior is integrable in the neighborhood of . In the following , so we rely on the asymptotic expansion of , which is detailed in Appendix D.

The proof is somewhat trickier for Matérn kernels with integer smoothness, so we tackle this case at the end. Until further notice, assume the kernel is Rational Quadratic, Squared Exponential or Matérn with noninteger smoothness .

For Rational Quadratic and Squared Exponential (resp. Matérn with noninteger smoothness parameter ) kernels, Appendix D.1 (resp. Appendix D.2) shows how can be decomposed as

 (13)

where

• is a differentiable function;

• with (actually, if the kernel is Rational Quadratic or Squared Exponential, );

• is a differentiable mapping from to such that ;

• and are both fixed symmetric matrices;

• is non-null;

• either is non-null or is nonsingular.

Lemma 15 implies that one of the following is true:

1. When is large enough, is nonsingular. This case can be further decomposed in the following subcases:

1. is nonsingular;

2. is singular, but is nonsingular when is large enough.

2. The vector space is non-trivial.

Let us differentiate :

 ddθW⊤ΣθW=g′(θ)g(θ)W⊤ΣθW+g(θ)(g⋆′(θ)W⊤D⋆W+ddθRg(θ)). (14)

We can show that . This is due to Equation (72) for Rational Quadratic and Squared Exponential kernels, and to Equation (75) for Matérn kernels with noninteger smoothness.

Lemma 11 shows that can be replaced by in Equation (3): , where

 (15)

We have , where

 ~w(θ):= ⎷Tr[{g(θ)(g⋆′(θ)W⊤D⋆W+ddθRg(θ))(W⊤ΣθW)−1}2]. (16)

A specific asymptotic analysis is required in each case. This study is conducted in Appendix

E. We summarize the results in Table 3.

The posterior distribution resulting from the reference prior is proper in all cases.

Matérn kernels with integer smoothness are dealt with in Appendix E.2.

## 5 Conclusion

The main result of this work is Theorem 9, which ensures that the reference prior leads to a proper posterior distribution for a large class of smooth kernels. This class contains the Squared Exponential correlation kernel as well as the important Matérn family [Ste99] with smoothness parameter . Rational Quadratic kernels, whose usage is less widespread are also included within this class.

BDOS01 proved this result for a class of rough correlation kernels. This class includes the complementary set of the Matérn family – kernels with smoothness parameter – as well as all other Power Exponential kernels. Spherical kernels, which are mostly used in the field of geostatistics also belong to this class.

Combining Theorem 9 with the results from BDOS01, one can appreciate how polyvalent the reference prior is, insofar as it is able to adapt to very different correlation kernels and always leads to a proper posterior. No ad-hoc technique is required to derive useable inference, so this approach seems to be flawless from a Bayesian point of view when no explicit prior information is available. Even when explicit prior information is available, following DM07

, it can be used to derive Maximum A Posteriori (MAP) estimates or High Probability Density (HPD) sets that are invariant under reparametrization.

## Acknowledgements

The author would like to thank his PhD advisor Professor Josselin Garnier (École Polytechnique, Centre de Mathématiques Appliquées) for his guidance, Loic Le Gratiet (EDF R&D, Chatou) and Anne Dutfoy (EDF R&D, Saclay) for their advice and helpful suggestions. The author acknowledges the support of the French Agence Nationale de la Recherche (ANR), under grant ANR-13-MONU-0005 (project CHORUS).

## Appendix A Algebraic facts

###### Lemma 10.

Let and be positive integers and let be a nonsingular symmetric matrix. Then, for any matrix with rank and any matrix with rank such that is the null matrix,

 B(B⊤ΣB)−1B⊤=Σ−1(Ia+b−A(A⊤Σ−1A)−1A⊤Σ−1). (17)
###### Proof.

Notice that both matrices have the same kernel, namely the subspace of spanned by . Indeed, because has full column rank and is nonsingular, the left matrix has the same kernel as . Besides, the -dimensional subspace of spanned by in included in this kernel. So because the rank of is , its kernel has dimension and the inclusion is an equality.

Similarly, because is nonsingular, the right matrix has the same kernel as . Moreover, because the image of is included within the image of , its dimension is lower or equal to . The image of on the other hand has dimension , so the image of has dimension greater or equal to and therefore its kernel has dimension lower or equal to . Now, a simple computation shows that the -dimensional subspace of spanned by in included in the kernel, so it is in fact equal to the kernel.

Besides, for any ,

 B(B⊤ΣB)−1B⊤(ΣBz)=Bz; (18) Σ−1(Ia+b−A(A⊤Σ−1A)−1A⊤Σ−1)(ΣBz)=Bz. (19)

So both matrices act the same way on the subspace spanned by , which is supplementary to their common kernel, hence the equality. ∎

###### Lemma 11.

Let be a positive integer, be a nonsingular matrix, and and be matrices. If there exists a real number such that

 A=tΣ+B, (20)

then

 Tr[{AΣ−1}2]−1m[Tr{AΣ−1}]2=Tr[{BΣ−1}2]−1m[Tr{BΣ−1}]2. (21)
###### Proof.

The lemma follows from a direct calculation:

 Tr[AΣ−1] =Tr[BΣ−1]+tm (22) Tr[{AΣ−1}2] =Tr[{BΣ−1}2]+2tTr[BΣ−1]+t2m (23)

###### Lemma 12.

Let be positive integers, be an symmetric positive definite matrix, be an symmetric matrix and be an matrix with rank . Denote by the matrix . Then, if there exist and such that the matrix is positive semi-definite and verifies , then

 √Tr[(Σ′Σ−1Q)2]−1m−aTr[Σ′Σ−1Q]2⩽(m−a)t2 (24)
###### Proof.

Let be an matrix with rank such that is the null matrix. Such a matrix

can for instance be constructed by computing a Singular Value Decomposition (SVD) of

: . In this decomposition, and are orthogonal matrices of size and respectively, and is an matrix whose only non-null entries are on the main diagonal. Therefore the last rows of are filled with zeros. So define as the matrix formed by the last columns of .

By applying Lemma 10, we obtain that .

Because of the properties of the trace, this implies

 Tr[Σ′Σ−1Q] =Tr[B⊤Σ′B(B⊤ΣB)−1] (25) Tr[(Σ′Σ−1Q)2] =Tr[{B⊤Σ′B(B⊤ΣB)−1}2]. (26)

Similarly, we have

 Tr[FΣ−1Q] =Tr[B⊤FB(B⊤ΣB)−1] (27) Tr[(FΣ−1Q)2] =Tr[{B⊤FB(B⊤ΣB)−1}2]. (28)

Because , Lemma 11 implies

 Tr[{B⊤Σ′B(B⊤ΣB)−1}2]−1m−aTr[B⊤Σ′B(B⊤ΣB)−1]2=Tr[{B⊤FB(B⊤ΣB)−1}2]−1m−aTr[B⊤FB(B⊤ΣB)−1]2. (29)

Combining the 5 equations above yields

 Tr[(Σ′Σ−1Q)2]−1m−aTr[Σ′Σ−1Q]2=Tr[(FΣ−1Q)2]−1m−aTr[FΣ−1Q]2. (30)

An elementary computation shows that .

Consider the Cholesky decomposition . Then .

 Tr[(FΣ−1Q)2] ⩽Tr[L−1QFQ⊤(L−1)⊤]2=Tr[FΣ−1Q]2. (31)

The inequality holds because is a symmetric positive semi-definite matrix.

Let

be a basis of unit eigenvectors of

such that for every integer , belongs to the kernel of . Indeed, because , this kernel has the same dimension as the kernel of : .

Denoting by the family of the eigenvalues corresponding to the family of eigenvectors , we have for every integer and

 (ξi)⊤Σξi=s−2i{(ξi)⊤Q⊤Σ−1}Σ{Σ−1Qξi}=s−2i(ξi)⊤Q⊤Σ−1Qξi=s−2i(ξi)⊤Σ−1Qξi=s−1i. (32)

This implies the third equality below:

 Tr[FΣ−1Q]=m∑i=1(ξi)⊤FΣ−1Qξi=m−a∑i=1si(ξi)⊤Fξi=m−a∑i=1(ξi)⊤Fξi(ξi)⊤Σξi⩽(m−a)t2. (33)

Equations (30) and (31) yield the result.

### a.1 Entire series

###### Lemma 13.

Let be a sequence of matrices of the same size. If exists and its kernel is the trivial vector space, then there exists a nonnegative integer such that is the trivial vector space.

###### Proof.

Assume the sum exists and its kernel is the trivial vector space. Consider the sequence where for every nonnegative integer , is the dimension of . is a nonincreasing sequence of nonegative integers, so it is convergent. If its limit is strictly greater than 0, then for every nonnegative integer , there exists a unit vector that belongs to . Because the unit sphere is compact, there exists an increasing mapping such that the subsequence converges to a limit such that . Besides, for every pair of nonnegative integers , . Given this set is closed, the limit also belongs to . So for every nonnegative integer , and therefore . So can only be the null vector, which is absurd since . We deduce from this contradiction that the limit of the sequence of integers is 0. Therefore there exists a nonnegative integer such that . ∎

## Appendix B Maclaurin series

The lemmas in this subsection deal with the following setting.

Let be a positive integer and let be a continuous mapping from to , the set of matrices. Assume admits the following Maclaurin series:

 M(t)=N∑k=0ak(t)Ak+B(t). (34)

In the above expression, is a nonnegative integer and for every :

1. is a continuous mapping such that for all , ;

2. for every nonnegative integer , when ;

3. is a non-null symmetric matrix.

is a continuous mapping such that for every , is a symmetric matrix and when , .

###### Lemma 14.

Consider (34). If is the trivial vector space and if there exists such that for all is nonsingular, then when , .

###### Proof.

Assume that is the trivial vector space and that there exists such that for all , is a nonsingular matrix.

If , then is nonsingular and the conclusion is trivial.

If , we may assume without loss of generality that is a nontrivial vector space, otherwise we could replace by and by for all .

Let be the codimension of . Let be an matrix whose columns form an orthonormal basis of , and let be an matrix whose columns form an orthonormal basis of its orthogonal complement. Then is an orthogonal matrix. For all , let us replace by . Because is an orthogonal matrix, the Frobenius norm of is unchanged. Naturally, for all , is replaced by and for every , is replaced by