Posterior Convergence of Gaussian and General Stochastic Process Regression Under Possible Misspecifications

10/24/2018 ∙ by Debashis Chatterjee, et al. ∙ 0

In this article, we investigate posterior convergence in nonparametric regression models where the unknown regression function is modeled by some appropriate stochastic process. In this regard, we consider two setups. The first setup is based on Gaussian processes, where the covariates are either random or non-random and the noise may be either normally or double-exponentially distributed. In the second setup, we assume that the underlying regression function is modeled by some reasonably smooth, but unspecified stochastic process satisfying reasonable conditions. The distribution of the noise is also left unspecified, but assumed to be thick-tailed. As in the previous studies regarding the same problems, we do not assume that the truth lies in the postulated parameter space, thus explicitly allowing the possibilities of misspecification. We exploit the general results of Shalizi (2009) for our purpose and establish not only posterior consistency, but also the rates at which the posterior probabilities converge, which turns out to be the Kullback-Leibler divergence rate. We also investigate the more familiar posterior convergence rates. Interestingly, we show that the posterior predictive distribution can accurately approximate the best possible predictive distribution in the sense that the Hellinger distance, as well as the total variation distance between the two distributions can tend to zero, in spite of misspecifications.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In statistics, either frequentist or Bayesian, nonparametric regression plays a very significant role. The frequentist nonparametric literature, however, is substantially larger than the Bayesian counterpart. Here we cite the books Schimek (2013), Härdle et al. (2012), Efromovich (2008), Takezawa (2006), Wu and Zhang (2006), Eubank (1999), Green and Silverman (1993) and Härdle (1990), among a large number of books on frequentist nonparametric regression. The Bayesian nonparametric literature, which is relatively young but flourishing in the recent times (see, for example, Ghosal and van derVaart (2017), Müller et al. (2015), Dey et al. (2012), Hjort et al. (2010), Ghosh and Ramamoorthi (2003)), offers much broader scope for interesting and innovative research.

The importance of Gaussian processes in nonparametric statistical modeling, particularly in the Bayesian context, is undeniable. It is widely used in density estimation (

Lenk (1988), Lenk (1991), Lenk (2003)), nonparametric regression (Rasmussen and Williams (2006)), spatial data modeling (Cressie (1993), Banerjee et al. (2014)

), machine learning (

Rasmussen and Williams (2006)), emulation of computer models (Santner et al. (2003)), to name a few areas. Although applications of Gaussian processes have received and continue to receive much attention, in the recent years there seems to be a growing interest among researchers in the theoretical properties of approaches based on Gaussian processes. Specifically, investigation of posterior convergence of Gaussian process based approaches has turned out to be an important undertaking. In this respect, contributions are made by Choi and Schervish (2007), van der Vaart and van Zanten (2008), van der Vaart and van Zanten (2009), van der Vaart and van Zanten (2011), Knapik et al. (2011), Vollmer (2013), Yang et al. (2018), Knapik and Salomond (2018). Choi and Schervish (2007)

address posterior consistency in Gaussian process regression, while the others also attempt to provide the rates of posterior convergence. However, the rates are so far computed under the assumption that the error distribution is normal and the error variance is either known, or if unknown, can be given a prior, but on a compact support bounded away from zero.

General priors for the regression function or thick-tailed noise distributions seemed to have received less attention. The asymptotic theory for such frameworks is even rare, Choi (2009) being an important exception. As much as we are aware of, rates of convergence are not available for nonparametric regression with general stochastic process prior on the regression function and thick-tailed noise distributions. Another important issue which seems to have received less attention in the literature, is the case of misspecified models. We are not aware of any published asymptotic theory pertaining to misspecifications in nonparametric regression, for either Gaussian or non-Gaussian processes with either normal or non-normal errors.

In this article, we consider both Gaussian and general stochastic process regression under the same setups as Choi and Schervish (2007) and Choi (2009), respectively, assuming that the covariates may be either random or non-random. For the Gaussian process setup we consider both normal and double-exponential distribution for the error, with unknown error variance. In the general context, we assume non-Gaussian noise with unknown scale parameter supported on the entire positive part of the real line. Based on the general theory of posterior convergence provided in Shalizi (2009), we establish posterior convergence theories for both the setups. We allow the case of misspecified models, that is, if the true regression function and the true error variance are not even supported by the prior. Our approach also enables us to show that the relevant posterior probabilities converge at the Kullback-Leibler (KL) divergence rate, and that the posterior convergence rate with respect to the KL-divergence is just slower than , being the number of observations. We further show that even in the case of misspecification, the posterior predictive distribution can approximate the best possible predictive distribution adequately, in the sense that the Hellinger distance, as well as the total variation distance between the two distributions can tend to zero. In Section 1.1 we provide a brief overview and intuitive explanation of the main assumptions and results of Shalizi, which we exploit in this article. The details are provided in Section S-1 of the supplement. The results of Shalizi are based on seven assumptions, which we refer to as (S1) – (S7) in this article.

1.1 A briefing of the main results of Shalizi

Let , and let and denote the observed and the true likelihoods respectively, under the given value of the parameter and the true parameter . We assume that , where is the (often infinite-dimensional) parameter space. However, we do not assume that , thus allowing misspecification. The key ingredient associated with Shalizi’s approach to proving convergence of the posterior distribution of is to show that the asymptotic equipartition property holds. To elucidate, let us consider the following likelihood ratio:

Then, to say that for each , the generalized or relative asymptotic equipartition property holds, we mean

(1.1)

almost surely, where is the KL-divergence rate given by

provided that it exists (possibly being infinite), where denotes expectation with respect to the true model. Let

Thus, can be roughly interpreted as the minimum KL-divergence between the postulated and the true model over the set . If , this indicates model misspecification. However, as we shall show, model misspecification need not always imply that . For , , so that .

As regards the prior, it is required to construct an appropriate sequence of sieves such that and , for some .

With the above notions, verification of (1.1) along with several other technical conditions ensure that for any such that ,

(1.2)

almost surely, provided that and

(1.3)

where denotes the posterior distribution of given . With respect to (1.2) note that implies positive KL-divergence in , even if . In other words, is the set in which the postulated model fails to capture the true model in terms of the KL-divergence. Hence, expectedly, the posterior probability of that set converges to zero. The result (1.3) asserts that the rate at which the posterior probability of converges to zero is about . From the above results it is clear that the posterior concentrates on sets of the form , for any .

As regards the rate of posterior convergence, let , where such that . Then under an additional technical assumption it holds, almost surely, that

(1.4)

Moreover, it was shown by Shalizi that the squares of the Hellinger and the total variation distances between the posterior predictive distribution and the best possible predictive distribution under the truth, are asymptotically almost surely bounded above by and , respectively. In other words, if , then this entails very accurate approximation of the true predictive distribution by the posterior predictive distribution.

The rest of our article is structured as follows. We treat the Gaussian process regression with normal and double exponential errors in Section 2. Specifically, our assumptions regarding the model and discussion of the assumptions are presented in Section 2.1. In Section 2.2 we present our main results of posterior convergence, along with the summary of the verification of Shalizi’s assumptions, for the Gaussian process setup. The complete details are provided in Sections S-2 and S-3 of the supplement. We deal with rate of convergence and model misspecification issue for Gaussian process regression in Sections 2.3 and 2.4, respectively.

The case of general stochastic process regression with thick tailed error distribution is taken up in Section 3. The assumptions with their discussion are provided in Section 3.1, the main posterior results are presented in Section 3.2, and Section 3.3 addresses the rate of convergence and model misspecification issue. Finally, we make concluding remarks in Section 4. The relevant details are provided in Section S-4 of the supplement.

2 The Gaussian process regression setup

As in Choi and Schervish (2007), we consider the following model:

(2.1)
(2.2)
(2.3)

In (2.2), stands for Gaussian process with mean function and positive definite covariance function , for any , where is the domain of .

As in Choi and Schervish (2007) we assume two separate distributions for the errors , independent zero-mean normal with variance which we denote by and independent double exponential distribution with median and scale parameter with density

We denote the double exponential distribution by .

In our case, let be the infinite-dimensional parameter associated with our Gaussian process model and let be the true (infinite-dimensional) parameter. Let denote the infinite-dimensional parameter space.

2.1 Assumptions and their discussions

Regarding the model and the prior, we make the following assumptions:

  • is a compact, -dimensional space, for some finite , equipped with a suitable metric.

  • The functions are continuous on and for such functions the limit

    (2.4)

    exists for each , and is continuous on , for . In the above, is the

    -dimensional vector where the

    -th element is 1 and all the other elements are zero. We denote the above class of functions by .

  • We assume the following for the covariates , accordingly as they are considered an observed random sample, or non-random.

    1. is an observed sample associated with an sequence associated with some probability measure , supported on , which is independent of .

    2. is an observed non-random sample. In this case, we consider a specific partition of the -dimensional space into subsets such that each subset of the partition contains at least one and has Lebesgue measure , for some .

  • Regarding the prior for , we assume that for large enough ,

    for and .

  • The true regression function satisfies . We do not assume that . For random covariate , we assume that is measurable.

2.1.1 Discussion of the assumptions

The compactness assumption on in Assumption (A1) guarantees that continuous functions on have finite sup-norms. Here, by sup-norm of any function on , we mean . Hence, our Gaussian process prior on , which gives probability one to continuously differentiable functions, also ensures that , almost surely. Compact support of the functions is commonplace in the Gaussian process literature; see, for example, Cramer and Leadbetter (1967), Adler (1981), Adler and Taylor (2007), Choi and Schervish (2007). The metric on is necessary for partitioning in the case of non-random covariates.

Condition (A2) is required for constructing appropriate sieves for proving our posterior convergence results. In particular, this is required to ensure that is Lipschitz continuous in the sieves. Since a function is Lipschitz if and only if its partial derivatives are bounded, this serves our purpose, as continuity of the partial derivatives of guarantees boundedness in the compact domain . Conditions guaranteeing the above continuity and smoothness properties required by (A2) must also be reflected in the underlying Gaussian process prior for . The relevant conditions can be found in Cramer and Leadbetter (1967), Adler (1981) and Adler and Taylor (2007), which we assume in our case. In particular, these require adequate smoothness assumptions on the mean function and the covariance function of the Gaussian process prior. It follows that ; , are also Gaussian processes. It clearly holds that and its partial derivatives also have finite sup-norms.

As regards (A3) (i), thanks to the strong law of large numbers (SLLN), given any

in the complement of some null set with respect to the prior, and given any sequence this assumption ensures that for any , as ,

(2.5)

where is some probability measure supported on .

Condition (A3) (ii) ensures that is a particular Riemann sum and hence (2.5) holds with being the Lebesgue measure on . We continue to denote the limit in this case by .

In the light of (2.5), condition (A3) will play important role in establishing the equipartition property, for both Gaussian and double exponential errors. Another important role of this condition is to ensure consistency of the posterior predictive distribution, in spite of some misspecifications.

Condition (A4) ensures that the prior probabilities of the complements of the sieves are exponentially small. Such a requirement is common to most Bayesian asymptotic theories.

The essence of (A5) is to allow misspecification of the prior for in a way that the true regression function is not even supported by the prior, even though it has finite sup-norm. In contrast, Choi and Schervish (2007) assumed that has continuous first-order partial derivatives. The assumption of measurability of is a very mild technical condition.

Let denote the infinite-dimensional parameter space for our Gaussian process model.

2.2 Posterior convergence of Gaussian process regression under normal and double exponential errors

In this section we provide a summary of our results leading to posterior convergence of Gaussian process regression when the errors are assumed to be either normal or double exponential. The details are provided in the supplement. The key results associated with the asymptotic equipartition property are provided in Lemma 2.1 and Theorem 1, the proofs of which are provided in the supplement in the context of detailed verification of Shalizi’s assumptions.

Lemma 2.1.

Under the Gaussian process model and conditions (A1) and (A3), the KL-divergence rate exists for , and is given by

(2.6)

for the normal errors, and

(2.7)

for the double exponential errors.

Theorem 1.

Under the Gaussian process model with normal and double exponential errors and conditions (A1) and (A3), the asymptotic equipartition property holds, and is given by

The convergence is uniform on any compact subset of .

Lemma 2.1 and Theorem 1 ensure that conditions (S1) – (S3) of Shalizi hold, and (S4) holds since is almost surely finite. We construct the sieves as

(2.8)

It follows that as and the properties of the Gaussian processes , , together with (A4) ensure that , for some . This result, continuity of , compactness of and the uniform convergence result of Theorem 1, together ensure (S5).

Now observe that the aim of assumption (S6) is to ensure that (see the proof of Lemma 7 of Shalizi (2009)) for every and for all sufficiently large,

Since as , it is enough to verify that for every and for all sufficiently large,

(2.9)

In this regard, first observe that

(2.10)

where the last inequality holds since . Now, letting , where is large as desired,

(2.11)

In Sections S-2.5.3 and S-3.5 we have proved continuity of for Gaussian and double exponential errors, respectively. Now observe that , so that implies (since ). Hence, for each , there exists a subset of depending upon such that and as . It then follows that and as . Hence observe that if and , or if tends to zero or some non-negative constant and . In both the cases , for both Gaussian and double exponential errors. In other words, is a continuous coercive function. Hence, is a compact set (see, for example, Lange (2010)). Now recall from the above arguments that if , then . Since for , it follows that must be bounded away from zero in . With these arguments, it is then easily seen that

(2.12)

We now show that

(2.13)

First note that if infinitely often, then for some infinitely often. But if and only if Hence, if we can show that

(2.14)

then (2.13) will be proved. We use the Borel-Cantelli lemma to prove (2.14). In other words, we prove in the supplement, in the context of verifying condition (S6) of Shalizi, that

Theorem 2.

For both normal and double exponential errors, under (A1)–(A5), it holds that

(2.15)

Since is continuous, (S7) holds trivially. In other words, all the assumptions (S1)–(S7) are satisfied for Gaussian process regression, for both normal and double exponential errors. Formally, our results lead to the following theorem.

Theorem 3.

Assume the Gaussian process regression model where the errors are either normally or double-exponentially distributed. Then under the conditions (A1) – (A5), (1.2) holds. Also, for any measurable set with , if , where is given by (2.6) for normal errors and (2.7) for double-exponential errors, or if for some , where is given by (2.8), then (1.2) and (1.3) hold.

2.3 Rate of convergence

Shalizi considered the set , where and , as , and proved the following result.

Theorem 4.

Under (S1)–(S7), if for each ,

(2.16)

eventually almost surely, then (1.4) holds almost surely.

To investigate the rate of convergence in our cases, we need to show that for any and all sufficiently large,

(2.17)

For such that as , it holds that . Since as well, , since is continuous in . Combining these arguments with(2.17) makes it clear that if we can show

(2.18)

for any and all sufficiently large, where such that as , then that is the rate of convergence. Now, the same steps as (2.10) lead to

(2.19)

Since is continuous and coercive for both Gaussian and double exponential errors, in the light of (2.19), (2.11), (2.12), (2.13) and (2.14) we only need to verify (2.15) to establish (2.18). As we have already verified (2.15) for both Gaussian and double exponential errors, (2.18) stands verified.

In other words, (2.16), and hence (1.4) hold for both the Gaussian process models with Gaussian and double exponential errors, so that their convergence rate is given by . In other words, the posterior rate of convergence with respect to KL-divergence is just slower than (just slower that with respect to Hellinger distance), for both kinds of errors that we consider. Our result can be formally stated as the following theorem.

Theorem 5.

For Gaussian process regression with either normal or double exponential errors, under (A1)–(A5), (1.4) holds almost surely, for such that .

2.4 Consequences of model misspecification

Suppose that the true function consists of countable number of discontinuities but has continuous first order partial derivatives at all other points. Then , that is, is not in the parameter space. However, there exists some such that for all where is continuous. Then if the probability measure of (A3) is dominated by the Lebesgue measure, it follows from (2.6) and (2.7), that for both the Gaussian and double exponential error models. In this case, the posterior of concentrates around , which is the same as except at the countable number of discontinuities of . If is such that , then the posterior concentrates around the minimizers of , provided such minimizers exist in .

Now, following Shalizi, let us define the one-step-ahead predictive distribution of by , with the convention that gives the marginal distribution of the first observation. Similarly, let , which is the best prediction one could make had been known. The posterior predictive distribution is given by . With the above definitions, Shalizi (2009) proved the following results:

Theorem 6.

Under assumptions (S1)–(S7), with probability 1,

(2.20)
(2.21)

where and are Hellinger and total variation metrics, respectively.

Since, for both our Gaussian process models with normal and double exponential errors, if consists of countable number of discontinuities, it follows from (2.20) and (2.21) that in spite of such misspecification, the posterior predictive distribution does a good job in learning the best possible predictive distribution in terms of the popular Hellinger and the total variation distance. We state our result formally as the following theorem.

Theorem 7.

In the Gaussian process regression problem with either normal or double exponential errors, assume that the true function consists of countable number of discontinuities but has continuous first order partial derivatives at all other points. Also assume that the probability measure of (A3) is dominated by the Lebesgue measure. Then under (A1) – (A5),

almost surely.

3 The general nonparametric regression setup

Following Choi (2009) we consider the following model:

(3.1)
(3.2)
(3.3)
(3.4)

In (3.2), we model the random errors ; as samples from some density . In (3.3), stands for any reasonable stochastic process prior, which may may or may not be Gaussian, and in (3.4), is some appropriate prior on .

3.1 Additional assumptions and their discussions

Regarding the model and the prior, we make the following assumptions in addition to (A1) – (A5) presented in Section 2.1:

  • The prior on is chosen such that for ,

    (3.5)

    where and ; , are positive constants.

  • is symmetric about zero; that is, for any , . Further, is -Lipschitz, that is, there exists a such that , for any .

  • For , let . Then given ,

    are independent sub-exponential random variables satisfying for any

    ,

    (3.6)

    where, for , ,

    (3.7)
  • For , , where . Also, .

    1. is jointly continuous in ;

    2. as .

3.1.1 Discussion of the new assumptions

Condition (A6) ensures that the prior probabilities of the complements of the sieves are exponentially small. Such a requirement is common to most Bayesian asymptotic theories. In particular, the first two inequalities are satisfied by Gaussian process priors even if is replaced by .

Assumption (A7) is the same as that of Choi (2009), and holds in the case of double exponential errors, for instance.

Conditions (A8), (A9) and (A10) are reasonably mild conditions, and as shown in the supplement, are satisfied by double exponential errors.

As before, let denote the infinite-dimensional parameter space for our model.

3.2 Posterior convergence

As before, we provide a summary of our results leading to posterior convergence in our general setup. The details are provided in the supplement.

Lemma 3.1.

Under our model assumptions and conditions (A1) and (A3), the KL-divergence rate exists for , and is given by

(3.8)

where .

Theorem 8.

Under our model assumptions and conditions (A1) and (A3), the asymptotic equipartition property holds, and is given by

The convergence is uniform on any compact subset of .

Lemma 3.1 and Theorem 8 ensure that conditions (S1) – (S3) of Shalizi hold, and (S4) holds since is almost surely finite. We construct the sieves as in (2.8). Hence, as before, as and the assumptions on , given by (A6), together with (A4) ensure that , for some . This result, continuity of , compactness of and the uniform convergence result of Theorem 1, together ensure (S5).

As regards (S6), let us note that from the definition of and Lipschitz continuity of , it follows that is Lipschitz continuous in . However, we still need to assume that is jointly continuous in . Due to (A10) it follows that is continuous in and as . In other words, is a continuous coercive function. Hence, is a compact set. With these observations, we then have the following result analogous to the Gaussian process case, the proof which is provided in the supplement.

Theorem 9.

In our setup, under (A1)–(A10), it holds that

Since is continuous, (S7) holds trivially. Thus, all the assumptions (S1)–(S7) are satisfied, showing that Theorems S-1 and S-2 hold. Formally, our results lead to the following theorem.

Theorem 10.

Assume the hierarchical model given by (3.1), (3.2), (3.3) and (3.4). Then under the conditions (A1) – (A10), (1.2) holds. Also, for any measurable set with , if , where is given by (3.8), or if for some , where is given by (2.8), then (1.3) holds.

3.3 Rate of convergence and consequences of model misspecification

For the general nonparametric model, the same result as Theorem 5 holds, under (A1)–(A10). Also, the same issues regarding model misspecification as detailed in Section 2.4 continues to be relevant in this setup. In other words, Theorem 7 holds under (A1) – (A10).

4 Conclusion

The fields of both theoretical and applied Bayesian nonparametric regression are dominated by Gaussian process priors and Gaussian noise. From the asymptotics perspective, even in the Gaussian setup, a comprehensive theory unifying posterior convergence for both random and non-random covariates along with the rate of convergence in the case of general priors for the unknown error variance, while also allowing for misspecification, seems to be very rare. Even more rare is the aforementioned investigations in the setting where a general stochastic process prior is on the unknown regression function is considered and the noise distribution is non-Gaussian and thick-tailed.

The approach of Shalizi allowed us to consider the asymptotic theory incorporating all the above issues, for both Gaussian and general stochastic process prior for the regression function. The approach, apart from enabling us to ensure consistency for both random and non-random covariates, allows us to compute the rate of convergence, while allowing misspecifications. Perhaps the most interesting result that we obtained is that even if the unknown regression function is misspecified, the posterior predictive distribution still captures the true predictive distribution asymptotically, for both Gaussian and general setups.

It seems that the most important condition among the assumptions of Shalizi is the asymptotic equipartition property. This directly establishes the KL property of the posterior which characterizes the posterior convergence, the rate of posterior convergence and misspecification. Interestingly, such a property that plays the key role, turned out to be relatively easy to establish in our context under reasonably mild conditions. On the other hand, in all the applications that we investigated so far, (S6) turned out to be the most difficult to verify. But the approach we devised to handle this condition and the others, seem to be generally applicable for investigating posterior asymptotics in general Bayesian parametric and nonparametric problems.

Supplementary Material

S-1 Preliminaries for ensuring posterior consistency under general set-up

Following Shalizi (2009) we consider a probability space , and a sequence of random variables , taking values in some measurable space , whose infinite-dimensional distribution is . The natural filtration of this process is .

We denote the distributions of processes adapted to by , where is associated with a measurable space , and is generally infinite-dimensional. For the sake of convenience, we assume, as in Shalizi (2009), that and all the are dominated by a common reference measure, with respective densities and . The usual assumptions that or even lies in the support of the prior on , are not required for Shalizi’s result, rendering it very general indeed.

S-1.1 Assumptions and theorems of Shalizi

  • Consider the following likelihood ratio:

    (S-1.1)

    Assume that is -measurable for all .

  • For every , the KL-divergence rate

    (S-1.2)

    exists (possibly being infinite) and is -measurable.

  • For each , the generalized or relative asymptotic equipartition property holds, and so, almost surely,

  • Let . The prior satisfies .

Following the notation of Shalizi (2009), for , let

(S-1.3)
(S-1.4)
(S-1.5)
  • There exists a sequence of sets as such that:

    1. (S-1.6)
    2. The convergence in (S3) is uniform in over .

    3. , as .

For each measurable , for every , there exists a random natural number such that

(S-1.7)

for all , provided . Regarding this, the following assumption has been made by Shalizi:

  • The sets of (S5) can be chosen such that for every , the inequality holds almost surely for all sufficiently large .

  • The sets of (S5) and (S6) can be chosen such that for any set with ,

    (S-1.8)

    as .

Under the above assumptions, Shalizi (2009) proved the following results.

Theorem S-1 (Shalizi (2009)).

Consider assumptions (S1)–(S7) and any set with and . Then,

where denotes the posterior distribution of given .

The rate of convergence of the log-posterior is given by the following result.

Theorem S-2 (Shalizi (2009)).

Consider assumptions (S1)–(S7) and any set with . If , where is given in (S-1.6) under assumption (S5), or if for some , then

where denotes the posterior distribution of given .

S-2 Verification of the assumptions of Shalizi for the Gaussian process model with normal errors

S-2.1 Verification of (S1)

note that

(S-2.1)
(S-2.2)

The equations (S-2.1) and (S-2.2) yield, in our case,

(S-2.3)

We show that the right hand side of (S-2.3), which we denote as , is continuous in , which is sufficient to confirm measurability of . Let , where is the Euclidean norm and