1 Introduction
The Bayesian statistical literature on model selection is rich in its collection of innovative methodologies. Among them the most principled method of comparing different competing models seems to be offered by Bayes factors, through the ratio of the posterior and prior odds associated with the models under comparison, which reduces to the ratio of the marginal densities of the data under the two models. To illustrate, let us consider the problem of comparing any two models
and given data , where is the sample size. Let and be the parameter spaces associated with and , respectively. For , let the likelihoods, priors and the marginal densities for the two models be . and , respectively. Then the Bayes factor (BF) of model against is given by(1.1) 
The above formula follows directly from the coherent procedure of Bayesian hypothesis testing of one model versus the other. In view of (1.1), admits the interpretation as the quantification of the evidence of against , given data . A comprehensive account of BF and its various advantages are provided in Kass95. BFs have interesting asymptotic convergence properties. Indeed, recently Chatterjee18 establish the almost sure convergence theory of BF in the general setup that includes even dependent data and misspecified models. Their result depends explicitly on the average KullbackLeibler (KL) divergence between the competing and the true models.
BFs are known to have several limitations. First, if the prior for the model parameter is improper, then the marginal density is also improper and hence does not admit any sensible interpretation. Second, BFs suffer from the JeffreysLindleyBartlett paradox (see Jeffreys39, Lindley57, Bartlett57, Robert93, Villa15 for details and general discussions on the paradox). Furthermore, a drawback of BFs in practical applications is that the marginal density of the data is usually quite challenging to compute accurately, even with sophisticated simulation techniques based on importance sampling, bridge sampling and path sampling (see, for example, Meng96, Gelman98; see also Gronau17 for a relatively recent tutorial and many relevant references), particularly when the posterior is far from normal and when the dimension of the parameter space is large. Moreover, the marginal density is usually extremely close to zero if is even moderately large. This causes numerical instability in computation of the BF.
The problems of BFs regarding improper prior, JeffreysLindleyBartlett paradox, and general computational difficulties associated with the marginal density can be simultaneously alleviated if the marginal density for model is replaced with the product of leaveoneout crossvalidation posteriors , where , and
(1.2) 
is the th leaveoneout crossvalidation posterior density evaluated at . In the above equation (1.2), is the density of given model parameters and ; is the posterior distribution of given . Viewing as the surrogate for , it seems reasonable to replace with the corresponding pesudoBayes factor (PBF) given by
(1.3) 
In the case of independent observations, the above formula and the terminology “pseudoBayes factor” seem to be first proposed by Geisser79. Their motivation for PBF did not seem to arise as providing solutions to the problems of BFs, however, but rather the urge to exploit the concept of crossvalidation in Bayesian model selection, which had been proved to be indispensable for constructing model selection criteria in the classical statistical paradigm. Below we argue how this crossvalidation idea helps solve the aforementioned problems of BFs.
First note that the posterior is usually proper even for improper prior for is is sufficiently large. Thus, given by (1.2) is usually welldefined even for improper priors, unlike . So, even though BF is illdefined for improper priors, PBF is usually still welldefined.
Second, a clear theoretical advantage of PBF over BF is that PBF is immune to the problem of JeffreysLindleyBartlett paradox (see Gelfand94 for example), while BF is certainly not.
Finally, PBF enjoys significant computational advantages over BF. Note that straightforward Monte Carlo averages of over realizations of obtained from
by simulation techniques is sufficient to ensure good estimates of the crossvalidation posterior density
. Since is the density of individually, the estimate is also numerically stable compared to estimates of . Hence, the sum of logarithms of the estimates of , for , results in quite accurate and stable estimates of . In other words, PBF is far simpler to compute accurately than BF and is numerically far more stable and reliable.In spite of the advantages of PBF over BF, it seems to be largely ignored in the statistical literature, both theoretically and applicationwise. Some asymptotic theory of PBF has been attempted by Gelfand94 using independent observations, Laplace approximations and some essentially adhoc simplifying approximations and arguments. Application of PBF has been considered in Bhattacharya08 for demonstrating the superiority of his new Bayesian nonparametric Dirichlet process model over the traditional Dirichlet process mixture model. But apart from these works we are not aware of any other significant research involving PBF.
In this article, we establish the asymptotic theory for PBF in the general setup consisting of dependent observations, model misspecifications as well as covariates; inclusion of covariates also validates our asymptotic theory in the variable selection framework. Judiciously exploiting the posterior convergence treatise of Shalizi09 we prove almost sure exponential convergence of PBF in favour of the true model, the convergence explicitly depending upon the KLdivergence rate from the true model. For any two models different from the true model, we prove almost sure exponential convergence of PBF in favour of the better model, where the convergence depends explicitly upon the difference between KLdivergence rates from the true model. Thus, our PBF convergence results agree with the BF convergence results established in Chatterjee18.
An important aspect of our PBF research involves establishing its convergence properties even for “inverse regression problems”, and even if one of the two competing models involve “inverse regression” and the other “forward regression”. We distinguish forward and inverse regression as follows. In forward regression problems the goal is to predict the response from a given covariate value and the rest of the data. On the other hand, in inverse regression unknown values of the covariates are to be predicted given the observed response and the rest of the data. Crucially, Bayesian inverse regression problems require priors on the covariate values to be predicted. In our case, the inverse regression setup has been motivated by the quantitative palaeoclimate reconstruction problem where ‘modern data’ consisting of multivariate counts of species are available along with the observed climate values. Also available are fossil assemblages of the same species, but deposited in lake sediments for past thousands of years. This is the fossil species data. However, the past climates corresponding to the fossil species data are unknown, and it is of interest to predict the past climates given the modern data and the fossil species data. Roughly, the species composition are regarded as functions of climate variables, since in general ecological terms, variations in climate drives variations in species, but not vice versa. However, since the interest lies in prediction of climate variables, the inverse nature of the problem is clear. The past climates, which must be regarded as random variables, may also be interpreted as
unobserved covariate values. It is thus natural to put a prior probability distribution on the unobserved covariate values. Various other examples of inverse regression problems are provided in
Chatterjee17.In this article, we consider two setups of inverse regression and establish almost sure exponential convergence of PBF in general inverse regression for both the setups. These include situations where one of the competing models involve forward regression and the other is associated with inverse regression.
We illustrate our asymptotic results with various theoretical examples in both forward and inverse regression contexts, including forward and inverse variable selection problems. We also follow up our theoretical investigations with simulation experiments in small samples involving Poisson and geometric forward and inverse regression models with relevant link functions and both linear regression and nonparametric regression, the latter modeled by Gaussian processes. We also illustrate variable selection in the aforementioned setups with two different covariates. The results that we obtain are quite encouraging and illuminating, providing useful insights into the behaviour of PBF for forward and inverse parametric and nonparametric regression.
The roadmap for the rest of our paper is as follows. We begin our progress by discussing and formalizing the relevant aspects of forward and inverse regression problems and the associated pseudoBayes factors in Section 2. Then in Section 3 we include a brief overview of Shalizi’s approach to treatment of posterior convergence which we usefully exploit for our treatise of PBF asymptotics; further details are provided in Appendix LABEL:subsec:assumptions_shalizi. Convergence of PBF in the forward regression context is established in Section 4, while in Sections 5 and 6 we establish convergence of PBF in the two setups related to inverse regression. In Sections 7 and LABEL:sec:illustrations_inverse we provide theoretical illustrations of PBF convergence in forward and inverse setups, respectively, with various examples including variable selection. Details of our simulation experiments with small samples involving Poisson and geometric linear and Gaussian process regression for relevant link functions, under both forward and inverse setups, are reported in Section LABEL:sec:simstudy, which also includes experiments on variable selection. Finally, we summarize our contributions and provide future directions in Section LABEL:sec:conclusion.
2 Preliminaries and general setup for forward and inverse regression problems
Let us first consider the forward regression setup.
2.1 Forward regression problem
For , let observed response be related to observed covariate through
(2.1) 
where for , and , are known densities depending upon (a set of) parameters , where is the parameter space, which may be infinitedimensional. For the sake of generality, we shall consider , where is a function of the covariates, which we more explicitly denote as . The covariate , being the space of covariates. The part of
will be assumed to consist of other parameters, such as the unknown error variance. For Bayesian forward regression problems, some prior needs to be assigned on the parameter space
. For notational convenience, we shall denote by , so that we can represent (2.1) more conveniently as(2.2) 
2.1.1 Examples of the forward regression setup

, where , where is some appropriate link function and is some function with known or unknown form. For known, suitably parameterized form, the model is parametric. If the form of is unknown, one may model it by a Gaussian process, assuming adequate smoothness of the function.

, where , where is some appropriate link function and is some function with known (parametric) or unknown (nonparametric) form. Again, in case of unknown form of , the Gaussian process can be used as a suitable model under sufficient smoothness assumptions.

, where is a parametric or nonparametric function and are Gaussian errors. In particular, may be a linear regression function, that is, , where
is a vector of unknown parameters. Nonlinear forms of
are also permitted. Also, may be a reasonably smooth function of unknown form, modeled by some appropriate Gaussian process.
2.2 Forward pseudoBayes factor
Letting , , and , let denote the posterior density at , given data , and model . Let the density of given and under model be denoted by . Then note that
(2.3) 
where
(2.4) 
For any two models and , the forward pseudo Bayes factor (FPBF) of against based on the crossvalidation posteriors of the form (2.3) is defined as follows:
(2.5) 
and we are interested in studying the limit for almost all data sequences.
2.3 Inverse regression problem: first setup
In inverse regression, the basic premise remains the same as in forward regression detailed in Section 2.1. In other words, the distribution , parameter , the parameter and the covariate space remain the same as in the forward regression setup. However, unlike in Bayesian forward regression problems where a prior needs to be assigned only to the unknown parameter , a prior is also required for , the unknown covariate observation associated with known response , say. Given the entire dataset and , the problem in inverse regression is to predict . Hence, in the Bayesian inverse setup, a prior on is necessary. Given model and the corresponding parameters , we denote such prior by . For Bayesian crossvalidation in inverse problems it is pertinent to successively leave out ;
, and compute the posterior predictive distribution
, from and the rest of the data (see Bhatta07). But these posteriors are not useful for Bayes of pesudoBayes factors even for inverse regression setups. The reason is that the Bayes factor for inverse regression is still the ratio of posterior odds and prior odds associated with the competing models, which as usual translates to the ratio of the marginal densities of the data under the two competing models. The marginal densities depend upon the prior for , however, under the competing models. The pseudoBayes factor for inverse models is then the ratio of products of the crossvalidation posteriors of , where and are marginalized out. Details of such inverse crossvalidation posteriors and the definition of pseudoBayes factors for inverse regression are given below.2.3.1 Inverse pseudoBayes factor in this setup
In the inverse regression setup, first note that
(2.6) 
Using (2.6) we obtain
(2.7) 
where
(2.8) 
and is the same as (2.4). For any two models and , the inverse pseudo Bayes factor (IPBF) of against based on crossvalidation posteriors of the form (2.7) is given by
(2.9) 
and our goal is to investigate for almost all data sequences.
2.4 Inverse regression problem: second setup
In the inverse regression context, we consider another setup under which Chat20 establish consistency of the inverse crossvalidation posteriors of . Here we consider experiments with covariate observations along with responses . In other words, the experiment considered here will allow us to have samples of responses against each covariate observation , for . Again, both and are allowed to be multidimensional. Let .
For consider the following general model setup: conditionally on , and ,
(2.10) 
independently, where as before.
2.4.1 Prior for
Following Chat20, we consider the following prior for : given ,
(2.11) 
the uniform distribution on
(2.12) 
where is some suitable transformation of . In (2.12), and , and is some constant. We denote this prior by . Chat20 show that the density or any probability associated with is continuous with respect to .
2.4.2 Examples of the prior

, where and for all . Here, under the prior , has uniform distribution on the set .

, where , with . Here is a known, onetoone, continuously differentiable function and is an unknown function modeled by Gaussian process. Here, the prior for is the uniform distribution on

, where , with . Here
is a known, increasing, continuously differentiable, cumulative distribution function and
is an unknown function modeled by some appropriate Gaussian process. Here, the prior for is the uniform distribution on . 
, where is an unknown function modeled by some appropriate Gaussian process, and are zeromean Gaussian noise with variance . Here, the prior for is the uniform distribution on . If , then the prior for is the uniform distribution on , where and .
Further examples of the prior in various other inverse regression models are provided in Sections LABEL:sec:illustrations_inverse and LABEL:sec:simstudy.
2.4.3 Inverse pseudoBayes factor in this setup
For any two models and we define inverse pseudoBayes factor for model against model , for any , as
(2.13) 
and study the limit for almost all data sequences. Note that since are distributed independently as given any and , it would follow that if the limit exists, it must be the same for all .
Suppose that the true datagenerating parameter is not contained in , the parameter space considered. This is a case of misspecification that we must incorporate in our convergence theory of PBF. Our PBF asymptotics draws on posterior convergence theory for (possibly infinitedimensional) parameters that also allows misspecification. In this regard, the approach presented in Shalizi09 seems to be very appropriate. Before proceeding further, we first provide a brief overview of this approach, which we conveniently exploit for our purpose.
3 A brief overview of Shalizi’s approach to posterior convergence
Let , and let and denote the observed and the true likelihoods respectively, under the given value of the parameter and the true parameter . We assume that , where is the (often infinitedimensional) parameter space. However, we do not assume that , thus allowing misspecification. The key ingredient associated with Shalizi’s approach to proving convergence of the posterior distribution of is to show that the asymptotic equipartition property holds. To elucidate, let us consider the following likelihood ratio:
Then, to say that for each , the generalized or relative asymptotic equipartition property holds, we mean
(3.1) 
almost surely, where is the KLdivergence rate given by
(3.2) 
provided that it exists (possibly being infinite), where denotes expectation with respect to the true model. Let
Thus, can be roughly interpreted as the minimum KLdivergence between the postulated and the true model over the set . If , this indicates model misspecification. For , , so that .
As regards the prior, it is required to construct an appropriate sequence of sieves such that and , for some .
With the above notions, verification of (3.1) along with several other technical conditions ensure that for any such that ,
(3.3) 
almost surely, provided that .
The seven assumptions of Shalizi leading to the above result, which we denote as (S1)–(S7), are provided in Appendix LABEL:subsec:assumptions_shalizi. In what follows, we denote almost sure convergence by “”, almost sure equality by “” and weak convergence by “”.
4 Convergence of PBF in forward problems
Let denote the true model which is also associated with parameter , where is a parameter space containing the true parameter . Then the following result holds.
Theorem 1.
Assume conditions (S1)–(S7) of Shalizi, and let the infimum of over be attained at , where . Also assume that and are complete separable metric spaces and that for , and are bounded and continuous in . Then,
(4.1) 
where, for any ,
(4.2) 
Proof.
Now, by hypothesis, the infimum of over be attained at , where . Then by (4.3), the posterior of given and , given by (2.4), concentrates around , the minimizer of the limiting KLdivergence rate from the true distribution. Formally, given any neighborhood of , the set is contained in for sufficiently small . It follows that for any neighborhood of , , almost surely, as . Since is a complete, separable metric space, it follows that (see, for example, Ghosh03, Ghosal17)
(4.4) 
Then, due to (4.4) and the Portmanteau theorem, as is bounded and continuous in , it holds using (2.3), that
(4.5) 
Now, due to (4.5),
(4.6) 
Also, essentially the same arguments leading to (4.5) yield
which ensures
(4.7) 
From (4.6) and (4.7) we obtain
(4.8) 
where the rightmost step of (4.8), given by (4.2), follows due to (3.1). Hence, the result is proved. ∎
For postulated model , let the KLdivergence rate in (3.2) be denoted by , for .
Theorem 2.
For models , and with complete separable parameter spaces , and , assume conditions (S1)–(S7) of Shalizi, and for , let the infimum of over be attained at , where . Also assume that for , ; , and are bounded and continuous in . Then,
(4.9) 
where, for , and for any ,
(4.10) 
Proof.
5 Convergence results for PBF in inverse regression: first setup
Theorem 3.
Assume conditions (S1)–(S7) of Shalizi, and let the infimum of over be attained at , where . Also assume that and are complete separable metric spaces and that for , and are bounded and continuous in . Then,
(5.1) 
where, for any ,
provided that the limit exists.