Assessment of model adequacy is always fundamental in statistics – this basic realization has given rise to a huge literature on testing goodness of model fit. However, compared to the classical literature, the Bayesian literature on model adequacy test is much scarce. A comprehensive overview of the existing approaches is provided in Vehtari12. Two relatively prominent existing formal and general approaches in this direction are those of Gelman96 and Bayarri00
. The former relies on posterior predictive-value associated with a discrepancy measure that is a function of the data as well as the parameters. The latter criticize this approach on account of ‘double use of the data’ and come up with two alternative -values, demonstrating their advantages over the posterior predictive -value. Indeed, double use of the data prevents the posterior predictive
-value to have uniform distribution on, while the -values of Bayarri00 at least asymptotically has the desired uniform distribution on .
introduced a different approach to Bayesian model assessment in ‘inverse problems’, where the model is built with response variables and covariates, but unlike ‘forward problems’, the interest is to predict unobserved covariates using the rest of the data, not response variables from the covariates and the remaining data. The palaeoclimate reconstruction problem provided the necessary motivation, where ‘modern data’ consisting of multivariate counts of species and observed climate values, and fossil assemblage data on the same species, deposited in lake sediments over thousands of years, are available. The interest is to reconstruct the past climate values corresponding to the fossil assemblages using the available data. Here, the species composition is modeled as a function of climate, since variations in climate is responsible for variations in species composition, but not vice versa. The inverse nature of the problem is evident since the interest lies in prediction of the past climate variables, not the species composition. Since the past climates are the unobserved (unknown) covariates, it is natural to consider a prior distribution for such unknown quantities.
The motivating example arises in quantitative palaeoclimate reconstruction where ‘modern data’ consisting of multivariate counts of species are available along with the observed climate values. Also available are fossil assemblages of the same species, but deposited in lake sediments for past thousands of years. This is the fossil species data. However, the past climates corresponding to the fossil species data are unknown, and it is of interest to predict the past climates given the modern data and the fossil species data. Roughly, the species composition are regarded as functions of climate variables, since in general ecological terms, variations in climate drives variations in species, but not vice versa. However, since the interest lies in prediction of climate variables, the inverse nature of the problem is clear. The past climates, which must be regarded as random variables, may also be interpreted asunobserved covariates
. It is thus natural to put a prior probability distribution on the unobserved covariates.
Broadly, the model assessment method of Bhattacharya13 is based on the simple idea that the model fits the data if the posterior distribution of the random variables corresponding to the covariates capture the observed values of the covariates. Now note that the true (observed) values will not be known in reality, which is why the training data (the modern data in the palaeoclimate problem, for instance) with observed covariates has been considered by Bhattacharya13. Assuming that the covariates are unobserved, one can predict these values in terms of the posterior distribution of the random quantities standing for the (assumed) missing covariates. Bhattacharya13 demonstrate that it makes more sense to consider leave-one-out cross-validation (LOO-CV) of the covariates particularly when some of the model parameters are given improper prior. From the traditional statistical perspective, LOO-CV is also a very natural method in model assessment. We henceforth concentrate on the LOO-CV approach proposed by Bhattacharya13. Briefly, based on the LOO-CV posteriors of the covariates, some appropriate ‘inverse reference distribution’ (IRD) is constructed. This IRD can be viewed as a distribution of some appropriate statistic associated with the unobserved covariates. If the distribution captures the observed statistic associated with the observed covariates, then the model is said to fit the data. Otherwise, the model does not fit the data. Bhattacharya13
provide a Bayesian decision theoretic justification of the key idea and show that the relevant IRD based posterior probability analogue of the aforementioned-values have the uniform distribution on . Furthermore, ample simulation studies and successful applications to several real, palaeoclimate models and data sets reported in Bhattacharya13, Bhattacharya06 and Sabya13, vindicate the practicality and usefulness of the IRD approach.
The rest of our paper is structured as follows. The general premise of our inverse regression model, LOO-CV and the IRD approach are described in Section 2. General consistency issues of the same are discussed in Section 3. We propose an appropriate prior for and investigate its properties in Section 4, and in Section 5 prove consistency of the LOO-CV posteriors under reasonably mild conditions. Relating consistency of the LOO-CV posteriors, we prove consistency of the IRD approach in Section 6. In Section 7 we provide a discussion on the issues and applicability of our asymptotic theory in various inverse regression contexts and in Section 8, we illustrate our asymptotic theory with simulation studies. Finally, we make concluding remarks in Section 9.
2 Preliminaries and general setup
We consider experiment with covariate observations along with responses . In other words, the experiment considered here will allow us to have samples of responses against covariate observations , for . Both and are allowed to be multidimensional. In this article, we consider large sample scenario where both .
For and , consider the following general model setup: conditionally on and ,
independently. In (2.1), is a known distribution depending upon (a set of) parameters , where is the parameter space, which may be infinite-dimensional. For the sake of generality, we shall consider , where is a function of the covariates, which we more explicitly denote as , where , being the space of covariates. The part of
will be assumed to consist of other parameters, such as the unknown error variance.
2.1 Examples of the above model setup
, where , where is some appropriate link function and is some function with known or unknown form. For known, suitably parameterized form, the model is parametric. If the form of is unknown, one may model it by a Gaussian process, assuming adequate smoothness of the function.
, where , where is some appropriate link function and is some function with known (parametric) or unknown (nonparametric) form. Again, in case of unknown form of , the Gaussian process can be used as a suitable model under sufficient smoothness assumptions.
, where is a parametric or nonparametric function and are Gaussian errors. In particular,
may be a linear regression function, that is,, where
is a vector of unknown parameters. Non-linear forms ofare also permitted. Also, may be a reasonably smooth function of unknown form, modeled by some appropriate Gaussian process.
2.2 The Bayesian inverse LOO-CV setup and the IRD approach
In the Bayesian inverse LOO-CV setup, for , we successively leave out from the data set, and attempt to predict the same using the rest of the dataset, in the form of the posterior , where , and , and is the random quantity corresponding to the left out .
In this article, we are interested in proving that almost surely as , where is any neighborhood of . Here, for any set , denotes the complement of .
Note that the -th LOO-CV posterior is given by
In the IRD approach, we consider the distribution of any suitable statistic , where the distribution of is induced by the respective LOO-CV posteriors of the form (2.2). The distribution of is referred to as the IRD in Bhattacharya13. Now consider the observed statistic . In a nutshell, if falls within the desired () of the IRD, then the model is said to fit the data; otherwise, the model does not fit the data. Typical examples of , which turned out to be useful in the palaeoclimate modeling context are (see Sabya13) are:
To obtain corresponding to above, we only need to replace with in (2.3) – (2.5). In the above, and denote the expectation and the variance, respectively, with respect to the LOO-CV posteriors. The statistic is itself, so that the posterior of is nothing but the -th LOO-CV posterior. Such a statistic can be important when there is particular interest in , for instance, if one suspects outlyingness of . An example of such an issue is considered in Bhatta07.
3 Discussion regarding consistency of the LOO-CV and the IRD approach
The question now arises if the IRD approach is at all consistent. That is, whether by increasing and , the distribution of will increasingly concentrate around . A sufficient condition for this to hold is consistency of the -th LOO-CV posterior at , for . From (2.2) it is clear that consistency of at the truth is required for this purpose, but even if in is replaced with , consistency of (2.2) at does not hold for arbitrary priors on , and for fixed . This has been demonstrated in Chatterjee17 with the help of a simple Poisson regression with mean , where both and are positive quantities. Special priors on is needed, along with the setup with , to achieve desired consistency of the LOO-CV posterior of at . In Section 4 we propose such an appropriate prior form and establish some requisite properties of the prior and . With such prior and with conditions that ensure consistency of at , we establish consistency of the LOO-CV posteriors in Section 5.
Indeed, in the setups that we consider, for any , is consistent at the true value . That is, for any neighbourhood of , for given , almost surely, as . Assuming complete separable metric space , this is again equivalent to weak convergence of to , as , for , for almost all data sequences (see, for example, Ghosh03, Ghosal17).
In our situations, we assume that the conditions of Shalizi09 hold for , which would ensure consistency of is consistent at the true value . The advantages of Shalizi’s results include great generality of the model and prior including dependent setups, and reasonably easy to verify conditions. The results crucially hinge on verification of the asymptotic equipartition property. In Section 3.1 we provide an overview of the main assumptions and result of Shalizi. The full details of the seven assumptions ()–() of Shalizi are provided in the Appendix. In Section 3.2 we show that Shalizi’s result leads to weak convergence of the posterior of to the point mass at , which will play an useful role in our proof of consistency of the LOO-CV posteriors.
3.1 A briefing of Shalizi’s approach
Let , and let and denote the observed and the true likelihoods respectively, under the given value of the parameter and the true parameter . We assume that , where is the (often infinite-dimensional) parameter space. However, it is not required to assume that , thus allowing misspecification. This is the general situation; however, as already mentioned, we do not consider misspecification for our purpose. The key ingredient associated with Shalizi’s approach to proving convergence of the posterior distribution of is to show that the asymptotic equipartition property holds. To elucidate, let us consider the following likelihood ratio:
Then, to say that for each , the generalized or relative asymptotic equipartition property holds, we mean
almost surely, where is the KL-divergence rate given by
provided that it exists (possibly being infinite), where denotes expectation with respect to the true model. Let
Thus, can be roughly interpreted as the minimum KL-divergence between the postulated and the true model over the set . If , this indicates model misspecification. For , , so that .
As regards the prior, it is required to construct an appropriate sequence of sieves such that and , for some .
With the above notions, verification of (3.1) along with several other technical conditions ensure that for any such that ,
almost surely, provided that .
3.2 Weak convergence of Shalizi’s result
From (3.3) it follows that for any ,
where . In our case, we shall not consider misspecification, as we are interested in ensuring posterior consistency. Thus, we have in our context. Now observe that given by (3.2) is not a proper KL-divergence between two distributions. Thus the question arises if (3.4) suffices for posterior consistency, and hence weak convergence of the posterior to . Lemma 3.1 below settles this question in the affirmative.
Given any neighborhood of , the set is contained in for sufficiently small .
It is sufficient to prove that if and only if . Note that is a proper KL-divergence and hence is non-decreasing with (see Erven14). Hence if , then there exists such that for all . Hence, given by (3.2) is larger than if . Of course, if , we must have , since otherwise, for all , which would imply . This proves the lemma. ∎
It follows from Lemma 3.1 that for any neighborhood of , , almost surely, as . Thus, , almost surely, as , where denotes weak convergence.
4 Prior for
We consider the following prior for : given ,
, where and for all . Here, under the prior , has uniform distribution on the set .
, where , with . Here is a known, one-to-one, continuously differentiable function and is an unknown function modeled by Gaussian process. Here, the prior for is the uniform distribution on
, where , with . Here
is a known, increasing, continuously differentiable, cumulative distribution function andis an unknown function modeled by some appropriate Gaussian process. Here, the prior for is the uniform distribution on .
, where is an unknown function modeled by some appropriate Gaussian process, and are zero-mean Gaussian noise with variance . Here, the prior for is the uniform distribution on .
4.2 Some properties of the prior
Our proposed prior for possesses several useful properties necessary for our asymptotic theory. These are formally provided in the lemmas below.
The prior density or any probability associated with is continuous with respect to .
Let be a sequence of functions such that , as , where denotes the sup norm. It then follows that for any set ,
Hence, as ,
where, for any set , denotes the Lebesgue measure of . This proves the lemma. ∎
If the density of given and , which we denote by , is continuous in and is bounded then it would follow from Lemma 4.1 and the dominated convergence theorem that and its associated probabilities are also continuous in . Below we formally present the result as Lemma 4.2.
If is continuous in and is bounded, then the density or any probability associated with is continuous with respect to .
However, we usually can not assume a compact parameter space. For example, such compactness assumption is invalid for Gaussian process priors for . But in most situations, continuity of the density of and its associated probabilities with respect to hold even without the compactness assumption, provided is continuous in . We thus make the following realistic assumption:
is continuous in .
The following result holds due to Assumption 1 and Scheffe’s theorem (see, for example, Schervish95).
If Assumption 1 holds, then any probability associated with is continuous in .
5 Consistency of the LOO-CV posteriors
For consistency of the LOO-CV posteriors given by (2.2), we first need to ensure weak convergence of almost surely to , as , for . This holds if and only if is consistent at . This can be seen by noting that the -th factor of , obtained by integrating out , does not play any role in by (3.1) and (3.2), so that these limits remain the same as in the case of . The other conditions of Shalizi also remain the same for both the posteriors and .
Hence, assuming that conditions (S1)–(S7) of Shalizi are verified for , for fixed , it follows that , almost surely, as .
For any neighborhood of , note that the probability is continuous in due to Lemma 4.3. Moreover, since it is a probability, it is bounded. Hence, by the Portmanteau theorem, using (2.2) and consistency of it holds almost surely that
We formalize this result as the following theorem.
Let us now make the following extra assumptions:
is continuous in .
is a one-to-one function.
With these assumptions, we have the following result.
for some positive constant . Now note that , and Assumption 3 ensures that almost surely, as , for all . It follows that there exists such that , for . Hence, , as . This implies, in conjunction with (5.4) and (5.3), that (5.2) holds.
6 Consistency of the IRD approach
Due to practical usefulness, we consider consistency of IRD associated with (2.3) – (2.5). Among these, the IRD associated with is just the -th LOO-CV posterior, which is consistent by Theorem 3. For and , we consider slight modification by dividing the right hand sides of (2.3) and (2.4) by , and adding some small quantity to . These adjustments are not significant for practical applications, but seems to be necessary for our asymptotic theory. With these, we provide the consistency result and its for the IRD corresponding to ; that corresponding to would follow in the same way.
The assumptions of this theorem ensures consistency of the LOO-CV posteriors due to Theorem 3. This again is equivalent to almost sure weak convergence of the -th cross-validation posterior to , for . This is again equivalent to convergence in (cross-validation posterior) distribution of , to the degenerate quantity , almost surely. Due to degeneracy, this is again equivalent to convergence in probability, almost surely.
For notational clarity we denote by , whose LOO-CV posterior is . Let also , so that we now denote by . It follows from the above arguments that for ,
Now consider , which is an average of terms, the -th term being
Now note that if were non-random, then , as , , would imply as , . Hence, by Theorem 7.15 of Schervish95 (page 398), it follows that
In other words, (6.1)) holds. ∎
7 Discussion of the applicability of our asymptotic results in the inverse regression contexts
From the development of the asymptotic results it is clear that there are two separate aspects that ensures consistency of the LOO-CV posteriors. The first is consistency of the posterior of the parameter(s) , and then consistency of . Once consistency of the posterior of is ensured, our prior for then guarantees consistency of the posterior of at . For verify consistency of the posterior of , we referred to the general conditions of Shalizi because of their wide applicability, including dependent setups, and relatively easy verifiability of the conditions. Indeed, the seven conditions of Shalizi have been verified in the contexts of general stochastic process (including Gaussian process) regression (Chatterjee1) with both Gaussian and double exponential errors, binary and Poisson regression involving general stochastic process (including Gaussian process) and known link functions (Chatterjee2) Moreover, for finite-dimensional parametric problems, the conditions are much simpler to verify. Thus, the examples provided in Section 4.1 are relevant in this context, and the LOO-CV posteriors, and hence the IRD, are consistent. Furthermore, Chandra1 and Chandra2 establish the conditions of Shalizi in an autoregressive regression context, even for the so-called “large , small ” paradigm. In such cases, our asymptotic results for the LOO-CV posteriors and the IRD, will hold.
There is one minor point to touch upon regarding our requirement for ensuring consistency. In all the aforementioned works regarding verification of Shalizi’s conditions, was considered. For our asymptotic theory, we first require consistency of as , for fixed , and then take the limit as . This is of course satisfied if consistency holds for , as for more information about brought in for larger values of , consistency automatically continues to hold. Indeed, for fixed , the limit as does not depend upon , as the posterior of converges weakly to the point mass at , almost surely. Thus, it is always sufficient to verify consistency of the posterior of for .
8 Simulation studies
8.1 Poisson parametric regression
Let us first consider the case where , as briefed in Section 4.1 (i). Here we investigate consistency of the posterior of . We generate the data by simulating , , , and then by generating , for and . We set ; , for the prior for .
Since numerical integration turned out to be unstable, we resort to Gibbs sampling from the posterior, noting that the full conditional distributions of and are of the forms
It follows that
has the gamma distribution with shape parameterand rate parameter , truncated on . Similarly, has the gamma distribution with shape parameter and rate parameter , truncated on .
For our investigation, we set . That is, without loss of generality, we address consistency of the posterior of via simulation study. As for the choice of , we set . This choice ensured that the full conditional distributions have reasonably large support, for given values of and . We run our Gibbs sampler for iterations, and discard the first iterations as burn-in.
Figure 8.1 displays the posterior densities of for different values of and ; here, for convenience of presentation, we have set . The vertical line denotes the true value . The diagram vividly depicts that the LOO-CV posterior of concentrates more and more around as and increase.
8.2 Poisson nonparametric regression
We now consider the case where , where , as briefed in Section 4.1 (ii). In particular, we let and be a Gaussian process with mean function and covariance , where is unknown. We assume that the true data-generating distribution is , with . We generate the data by simulating , and ; , and then finally simulating ; , .
For our convenience, we reparameterize as , where . For the prior on the parameters, we set , for . Now note that the prior for , which is uniform on , does not have a closed form, since the form of is unknown. However, if is large, the interval is small, and falling in this small interval can be reasonably well-approximated by a straight line. Hence, we set , for falling in this interval. In our case, it follows that , where