In a very influential paper, BDOS01 pioneered the field of Objective Bayesian analysis of spatial models. Previous works [DOKS97, Ste99] had noted that commonly used noninformative priors sometimes failed to yield proper posteriors, but BDOS01 were the first to thoroughly investigate the issue. Among several prior distributions – truncated priors, vague priors, Jeffreys-rule and independence Jeffreys prior – they showed that the reference prior [Ber05] is the most satisfying choice for a default prior distribution. This is due in no small part to the fact that, in the wide variety of cases studied by BDOS01, it systematically yields a proper posterior distribution. In this article, we complete their proof of this property.
Section 2 describes the Gaussian Process models studied by BDOS01. Section 3 shows that the proof of reference posterior propriety provided by BDOS01 only applies to those with rough correlation kernels – Spherical, Exponential with power , Matérn with smoothness . Section 4 contains the core of this paper: a proof of Theorem 9 which asserts that the reference prior leads to a proper posterior for models with smoother correlation kernels – Exponential with power , Matérn with smoothness , Rational Quadratic.
The rest of the introduction illustrates the significance of the reference prior yielding a proper posterior.
For smooth one-dimensional parametric families, the reference prior coincides with the Jeffreys-rule prior [CB94]. For finite-dimensional smooth parametric families, the reference prior algorithm requires the user to define groups of dimensions of the parameter and rank them. The reference prior is then defined iteratively:
Compute the Jeffreys-rule prior on the lowest-ranking group of dimensions conditionally to all others.
Average the likelihood function over this prior.
Compute the Jeffreys-rule prior (based on the integrated likelihood function) on the second-lowest-ranking group of dimensions conditionally to all higher-ranking dimensions.
Average the integrated likelihood function over this second prior.
Continue the process until the Jeffreys-rule prior on the highest-ranking group of dimensions has been computed.
The reference prior is defined as the product of all successively computed priors.
In the cases studied by BDOS01, the dimensions of the parameter can be put in two natural categories: “location” and “covariance”, the latter being higher-ranking because if they were known, the model would be trivialized.
The process entails however a significant difficulty, which arises because the successively computed Jeffreys-rule priors are often improper. We do not consider this difficulty here, but it is touched upon in BDOS01 and more thoroughly discussed in RSH12. It can be avoided by using “asymptotic marginalization” [BB92] instead of “exact marginalization”, but BDOS01 found that the resulting “approximate” reference prior, unlike the “true” reference prior, does not always yield a proper posterior in the studied cases.
Because it is so difficult to obtain a satisfying default prior distribution which consistently yields a proper posterior, it is important to ascertain that the reference prior actually does. Indeed, a vast literature [paulo05, RSH12, KP12, RSS13, GWB18] builds upon BDOS01’s result and depends on it.
BDOS01 consider models of Gaussian Process regression, also known as Universal Kriging, with isotropic autocorrelation kernels. Because isotropy is key, define
as the usual Euclidean norm if applied to a vector and as the Frobenius norm if applied to a matrix. In Universal Kriging, an unknown mapping from a spatial domain() to is assumed to be a realization of a Gaussian process . The mean function of the Gaussian process is assumed to belong to some known vector space of dimension . If is non-zero, once a basis of has been set, can be parametrized by such that .
is assumed in the model to be an isotropic Gaussian process based on an autocorrelation kernel . is a mapping such that for any positive integer and any collection of distinct points within , the symmetric matrix with -th element is a positive definite correlation matrix. Necessarily, .
The autocovariance function of the Gaussian process is , where is the autocorrelation kernel parametrized by and defined by , making
the variance offor all .
Fix and a collection of distinct points . Let this collection be the design set, i.e. the set of points where is observed. is a Gaussian vector with mean vector and covariance matrix , where denotes the matrix with -th element . Table 1 provides the definition of several correlation kernels.
If is non-zero, let denote the matrix with -th element . [Note: if , then we adopt the convention that any term involving can be ignored.] Then . Denote by the observed value of the random vector . The likelihood function of the parameter triplet has the following expression:
In order for the model to be identifiable, assume that and that has rank .
BDOS01 derive the reference prior corresponding to the parameter ordering [if , is meaningless, so the ordering is ]. One can see [RSH12] that the reference prior corresponding with the ordering [if , ] is the same.
To express it conveniently, denote by the matrix [if , ]. Also fix , an matrix such that and is the null matrix. ’s columns form an orthonormal basis of the orthogonal complement of the subspace of spanned by the columns of [if , fix
as an orthogonal matrix, for instance].
The reference prior with ordering is , where
Denoting by , can also be written as:
The first assertion is from RSH12. The second is a consequence of Lemma 10. ∎
If , after marginalizing and out, we have
Alternatively, the integrated likelihood with can also be written
If , the integrated likelihood is simply
The result for and the first result for are from BDOS01. Lemma 10 yields that
So all that remains to be proved is that . Choose an matrix with columns forming an orthonormal basis of the subspace of spanned by the columns of . is therefore an orthogonal matrix, so . Using Schur’s complement, we have
Lemma 10 again yields the result. ∎
3 Smoothness of the correlation kernel
Lemma 2 of BDOS01 requires that correlation kernel and design set should be such that , where is the vector with entries all equal to 1, is a real-valued function such that , is a fixed nonsingular matrix and is a mapping from to the set of real matrices such that .
What makes this assumption restrictive is the condition that should be nonsingular, because it holds for rough correlation kernels only. For instance, as was noted by paulo05, it does not hold for the Squared Exponential correlation kernel.
For a given correlation kernel , is typically a matrix proportional to the matrix with entries , where depends on the smoothness of the correlation kernel but should in any case belong to the interval . This is because is equivalent to a constant times when .
Sch37 gives the following result (Theorem 4 in the original paper):
If , the quadratic form is nonsingular and its canonical representation contains one positive and negative squares.
This means that if the correlation kernel is rough enough to have , the assumption that is nonsingular is reasonable.
The matrix with entries with is nonsingular and has one positive eigenvalue and
is nonsingular and has one positive eigenvalue andnegative eigenvalues.
The picture is dramatically different when the correlation kernel is smooth enough to have . This happens as soon as is twice continuously differentiable. Gow85’s Theorem 6 implies the following results:
If is the dimension of , the smallest Euclidean subspace containing all points in the design set, then the matrix with entries has rank:
(one positive eigenvalue, negative eigenvalues, any other eigenvalue null) if all points in the design set lie on the surface of a hypersphere of ;
(one positive eigenvalue, negative eigenvalues, any other eigenvalue null) otherwise.
The matrix with entries has rank lower or equal to .
For all practical purposes, is much greater than , so the matrix is singular when .
Let us review the values of for correlation kernels listed in Table 1. Matérn correlation kernels [Mat86] [HS93] with smoothness parameter have , thus for , but for , . Spherical correlation kernels [Wac95] have . Power Exponential kernels [DOKS97] have equal to their power. This means that all Power Exponential kernels except the Squared Exponential correlation kernel have . In particular, the Exponential kernel (which is also the Matérn kernel with smoothness ) has , but the Squared Exponential kernel has . Rational Quadratic kernels [Yag87] have . For easy reference, the review is summarized in Table 2.
The above review justifies the claim in the abstract that the Squared Exponential kernel, Matérn kernels with smoothness and Rational Quadratic kernels require a proof of the reference posterior’s propriety.
|Power Exponential ()||yes|
4 Propriety of the reference posterior distribution
BDOS01 show that the reference posterior distribution on and conditionally to is proper.
In this section, we prove that the joint reference posterior distribution is proper for Matérn kernels with smoothness , Rational Quadratic kernels and the Squared Exponential kernel.
For Matérn kernels with smoothness , for Rational Quadratic kernels with parameter and for the Squared Exponential kernel, the “marginal” reference prior distribution defined by Proposition 1 has the following behavior.
Denoting any of these kernels by , is continuously differentiable.
If is Squared Exponential, . This also holds if is Matérn with smoothness (see AS64 9.6.28. and 9.7.2.). If is Rational Quadratic with parameter , .
Moreover, converges to when , so its inverse does too. The first assertion follows from these facts.
Let be the ordered eigenvalues of .
For Rational Quadratic and Squared Exponential kernels and for Matérn kernels with smoothness , there exists a hyperplane
, there exists a hyperplaneof such that for every , when :
In the following, when belongs to , we write that “ looks nondegenerate”. This terminology relies on the intuition that if the observation were to take some values within , it would be better explained by a degenerate Gaussian model. The most compelling example is that of a constant observation vector, for which the Kriging model would be grossly inappropriate.
For Matérn kernels with noninteger smoothness , for Rational Quadratic kernels and for the Squared Exponential kernel, regardless of the design set and of the mean function space, if looks nondegenerate, then the reference posterior distribution is proper.
The first assertion of Proposition 7 implies the reference prior is integrable in the neighborhood of 0. Furthermore, when , so the reference posterior is integrable in the neighborhood of 0 as well.
All that remains to be proved is therefore that the reference posterior is integrable in the neighborhood of . In the following , so we rely on the asymptotic expansion of , which is detailed in Appendix D.
The proof is somewhat trickier for Matérn kernels with integer smoothness, so we tackle this case at the end. Until further notice, assume the kernel is Rational Quadratic, Squared Exponential or Matérn with noninteger smoothness .
is a differentiable function;
with (actually, if the kernel is Rational Quadratic or Squared Exponential, );
is a differentiable mapping from to such that ;
and are both fixed symmetric matrices;
either is non-null or is nonsingular.
Lemma 15 implies that one of the following is true:
When is large enough, is nonsingular. This case can be further decomposed in the following subcases:
is singular, but is nonsingular when is large enough.
The vector space is non-trivial.
Let us differentiate :
We have , where
A specific asymptotic analysis is required in each case. This study is conducted in AppendixE. We summarize the results in Table 3.
|1.(a)||Matérn (), RQ, SE|
|1.(b)||Matérn (), RQ, SE|
|2.||RQ, SE (usual case)|
|2.||RQ, SE (special case)|
The posterior distribution resulting from the reference prior is proper in all cases.
Matérn kernels with integer smoothness are dealt with in Appendix E.2.
The main result of this work is Theorem 9, which ensures that the reference prior leads to a proper posterior distribution for a large class of smooth kernels. This class contains the Squared Exponential correlation kernel as well as the important Matérn family [Ste99] with smoothness parameter . Rational Quadratic kernels, whose usage is less widespread are also included within this class.
BDOS01 proved this result for a class of rough correlation kernels. This class includes the complementary set of the Matérn family – kernels with smoothness parameter – as well as all other Power Exponential kernels. Spherical kernels, which are mostly used in the field of geostatistics also belong to this class.
Combining Theorem 9 with the results from BDOS01, one can appreciate how polyvalent the reference prior is, insofar as it is able to adapt to very different correlation kernels and always leads to a proper posterior. No ad-hoc technique is required to derive useable inference, so this approach seems to be flawless from a Bayesian point of view when no explicit prior information is available. Even when explicit prior information is available, following DM07
The author would like to thank his PhD advisor Professor Josselin Garnier (École Polytechnique, Centre de Mathématiques Appliquées) for his guidance, Loic Le Gratiet (EDF R&D, Chatou) and Anne Dutfoy (EDF R&D, Saclay) for their advice and helpful suggestions. The author acknowledges the support of the French Agence Nationale de la Recherche (ANR), under grant ANR-13-MONU-0005 (project CHORUS).
Appendix A Algebraic facts
Let and be positive integers and let be a nonsingular symmetric matrix. Then, for any matrix with rank and any matrix with rank such that is the null matrix,
Notice that both matrices have the same kernel, namely the subspace of spanned by . Indeed, because has full column rank and is nonsingular, the left matrix has the same kernel as . Besides, the -dimensional subspace of spanned by in included in this kernel. So because the rank of is , its kernel has dimension and the inclusion is an equality.
Similarly, because is nonsingular, the right matrix has the same kernel as . Moreover, because the image of is included within the image of , its dimension is lower or equal to . The image of on the other hand has dimension , so the image of has dimension greater or equal to and therefore its kernel has dimension lower or equal to . Now, a simple computation shows that the -dimensional subspace of spanned by in included in the kernel, so it is in fact equal to the kernel.
Besides, for any ,
So both matrices act the same way on the subspace spanned by , which is supplementary to their common kernel, hence the equality. ∎
Let be a positive integer, be a nonsingular matrix, and and be matrices. If there exists a real number such that
The lemma follows from a direct calculation:
Let be positive integers, be an symmetric positive definite matrix, be an symmetric matrix and be an matrix with rank . Denote by the matrix . Then, if there exist and such that the matrix is positive semi-definite and verifies , then
Let be an matrix with rank such that is the null matrix. Such a matrix
can for instance be constructed by computing a Singular Value Decomposition (SVD) of: . In this decomposition, and are orthogonal matrices of size and respectively, and is an matrix whose only non-null entries are on the main diagonal. Therefore the last rows of are filled with zeros. So define as the matrix formed by the last columns of .
By applying Lemma 10, we obtain that .
Because of the properties of the trace, this implies
Similarly, we have
Because , Lemma 11 implies
Combining the 5 equations above yields
An elementary computation shows that .
Consider the Cholesky decomposition . Then .
The inequality holds because is a symmetric positive semi-definite matrix.
be a basis of unit eigenvectors ofsuch that for every integer , belongs to the kernel of . Indeed, because , this kernel has the same dimension as the kernel of : .
Denoting by the family of the eigenvalues corresponding to the family of eigenvectors , we have for every integer and
This implies the third equality below:
a.1 Entire series
Let be a sequence of matrices of the same size. If exists and its kernel is the trivial vector space, then there exists a nonnegative integer such that is the trivial vector space.
Assume the sum exists and its kernel is the trivial vector space. Consider the sequence where for every nonnegative integer , is the dimension of . is a nonincreasing sequence of nonegative integers, so it is convergent. If its limit is strictly greater than 0, then for every nonnegative integer , there exists a unit vector that belongs to . Because the unit sphere is compact, there exists an increasing mapping such that the subsequence converges to a limit such that . Besides, for every pair of nonnegative integers , . Given this set is closed, the limit also belongs to . So for every nonnegative integer , and therefore . So can only be the null vector, which is absurd since . We deduce from this contradiction that the limit of the sequence of integers is 0. Therefore there exists a nonnegative integer such that . ∎
Appendix B Maclaurin series
The lemmas in this subsection deal with the following setting.
Let be a positive integer and let be a continuous mapping from to , the set of matrices. Assume admits the following Maclaurin series:
In the above expression, is a nonnegative integer and for every :
is a continuous mapping such that for all , ;
for every nonnegative integer , when ;
is a non-null symmetric matrix.
is a continuous mapping such that for every , is a symmetric matrix and when , .
Consider (34). If is the trivial vector space and if there exists such that for all is nonsingular, then when , .
Assume that is the trivial vector space and that there exists such that for all , is a nonsingular matrix.
If , then is nonsingular and the conclusion is trivial.
If , we may assume without loss of generality that is a nontrivial vector space, otherwise we could replace by and by for all .
Let be the codimension of . Let be an matrix whose columns form an orthonormal basis of , and let be an matrix whose columns form an orthonormal basis of its orthogonal complement. Then is an orthogonal matrix. For all , let us replace by . Because is an orthogonal matrix, the Frobenius norm of is unchanged. Naturally, for all , is replaced by and for every , is replaced by