 # Convergence of Pseudo-Bayes Factors in Forward and Inverse Regression Problems

In the Bayesian literature on model comparison, Bayes factors play the leading role. In the classical statistical literature, model selection criteria are often devised used cross-validation ideas. Amalgamating the ideas of Bayes factor and cross-validation Geisser and Eddy (1979) created the pseudo-Bayes factor. The usage of cross-validation inculcates several theoretical advantages, computational simplicity and numerical stability in Bayes factors as the marginal density of the entire dataset is replaced with products of cross-validation densities of individual data points. However, the popularity of pseudo-Bayes factors is still negligible in comparison with Bayes factors, with respect to both theoretical investigations and practical applications. In this article, we establish almost sure exponential convergence of pseudo-Bayes factors for large samples under a general setup consisting of dependent data and model misspecifications. We particularly focus on general parametric and nonparametric regression setups in both forward and inverse contexts. We illustrate our theoretical results with various examples, providing explicit calculations. We also supplement our asymptotic theory with simulation experiments in small sample situations of Poisson log regression and geometric logit and probit regression, additionally addressing the variable selection problem. We consider both linear and nonparametric regression modeled by Gaussian processes for our purposes. Our simulation results provide quite interesting insights into the usage of pseudo-Bayes factors in forward and inverse setups.

Comments

There are no comments yet.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The Bayesian statistical literature on model selection is rich in its collection of innovative methodologies. Among them the most principled method of comparing different competing models seems to be offered by Bayes factors, through the ratio of the posterior and prior odds associated with the models under comparison, which reduces to the ratio of the marginal densities of the data under the two models. To illustrate, let us consider the problem of comparing any two models

and given data , where is the sample size. Let and be the parameter spaces associated with and , respectively. For , let the likelihoods, priors and the marginal densities for the two models be . and , respectively. Then the Bayes factor (BF) of model against is given by

 BF(n)(M1,M2)=m(Yn|M1)m(Yn|M2). (1.1)

The above formula follows directly from the coherent procedure of Bayesian hypothesis testing of one model versus the other. In view of (1.1), admits the interpretation as the quantification of the evidence of against , given data . A comprehensive account of BF and its various advantages are provided in Kass95. BFs have interesting asymptotic convergence properties. Indeed, recently Chatterjee18 establish the almost sure convergence theory of BF in the general setup that includes even dependent data and misspecified models. Their result depends explicitly on the average Kullback-Leibler (KL) divergence between the competing and the true models.

BFs are known to have several limitations. First, if the prior for the model parameter is improper, then the marginal density is also improper and hence does not admit any sensible interpretation. Second, BFs suffer from the Jeffreys-Lindley-Bartlett paradox (see Jeffreys39, Lindley57, Bartlett57, Robert93, Villa15 for details and general discussions on the paradox). Furthermore, a drawback of BFs in practical applications is that the marginal density of the data is usually quite challenging to compute accurately, even with sophisticated simulation techniques based on importance sampling, bridge sampling and path sampling (see, for example, Meng96, Gelman98; see also Gronau17 for a relatively recent tutorial and many relevant references), particularly when the posterior is far from normal and when the dimension of the parameter space is large. Moreover, the marginal density is usually extremely close to zero if is even moderately large. This causes numerical instability in computation of the BF.

The problems of BFs regarding improper prior, Jeffreys-Lindley-Bartlett paradox, and general computational difficulties associated with the marginal density can be simultaneously alleviated if the marginal density for model is replaced with the product of leave-one-out cross-validation posteriors , where , and

 π(yi|Yn,−i,Mj)=∫Θjf(yi|θj,y1,…,yi−1,Mj)dπ(θj|Yn,−i,Mj) (1.2)

is the -th leave-one-out cross-validation posterior density evaluated at . In the above equation (1.2), is the density of given model parameters and ; is the posterior distribution of given . Viewing as the surrogate for , it seems reasonable to replace with the corresponding pesudo-Bayes factor (PBF) given by

 PBF(n)(M1,M2)=∏ni=1π(yi|Yn,−i,M1)∏ni=1π(yi|Yn,−i,M2). (1.3)

In the case of independent observations, the above formula and the terminology “pseudo-Bayes factor” seem to be first proposed by Geisser79. Their motivation for PBF did not seem to arise as providing solutions to the problems of BFs, however, but rather the urge to exploit the concept of cross-validation in Bayesian model selection, which had been proved to be indispensable for constructing model selection criteria in the classical statistical paradigm. Below we argue how this cross-validation idea helps solve the aforementioned problems of BFs.

First note that the posterior is usually proper even for improper prior for is is sufficiently large. Thus, given by (1.2) is usually well-defined even for improper priors, unlike . So, even though BF is ill-defined for improper priors, PBF is usually still well-defined.

Second, a clear theoretical advantage of PBF over BF is that PBF is immune to the problem of Jeffreys-Lindley-Bartlett paradox (see Gelfand94 for example), while BF is certainly not.

Finally, PBF enjoys significant computational advantages over BF. Note that straightforward Monte Carlo averages of over realizations of obtained from

by simulation techniques is sufficient to ensure good estimates of the cross-validation posterior density

. Since is the density of individually, the estimate is also numerically stable compared to estimates of . Hence, the sum of logarithms of the estimates of , for , results in quite accurate and stable estimates of . In other words, PBF is far simpler to compute accurately than BF and is numerically far more stable and reliable.

In spite of the advantages of PBF over BF, it seems to be largely ignored in the statistical literature, both theoretically and application-wise. Some asymptotic theory of PBF has been attempted by Gelfand94 using independent observations, Laplace approximations and some essentially ad-hoc simplifying approximations and arguments. Application of PBF has been considered in Bhattacharya08 for demonstrating the superiority of his new Bayesian nonparametric Dirichlet process model over the traditional Dirichlet process mixture model. But apart from these works we are not aware of any other significant research involving PBF.

In this article, we establish the asymptotic theory for PBF in the general setup consisting of dependent observations, model misspecifications as well as covariates; inclusion of covariates also validates our asymptotic theory in the variable selection framework. Judiciously exploiting the posterior convergence treatise of Shalizi09 we prove almost sure exponential convergence of PBF in favour of the true model, the convergence explicitly depending upon the KL-divergence rate from the true model. For any two models different from the true model, we prove almost sure exponential convergence of PBF in favour of the better model, where the convergence depends explicitly upon the difference between KL-divergence rates from the true model. Thus, our PBF convergence results agree with the BF convergence results established in Chatterjee18.

An important aspect of our PBF research involves establishing its convergence properties even for “inverse regression problems”, and even if one of the two competing models involve “inverse regression” and the other “forward regression”. We distinguish forward and inverse regression as follows. In forward regression problems the goal is to predict the response from a given covariate value and the rest of the data. On the other hand, in inverse regression unknown values of the covariates are to be predicted given the observed response and the rest of the data. Crucially, Bayesian inverse regression problems require priors on the covariate values to be predicted. In our case, the inverse regression setup has been motivated by the quantitative palaeoclimate reconstruction problem where ‘modern data’ consisting of multivariate counts of species are available along with the observed climate values. Also available are fossil assemblages of the same species, but deposited in lake sediments for past thousands of years. This is the fossil species data. However, the past climates corresponding to the fossil species data are unknown, and it is of interest to predict the past climates given the modern data and the fossil species data. Roughly, the species composition are regarded as functions of climate variables, since in general ecological terms, variations in climate drives variations in species, but not vice versa. However, since the interest lies in prediction of climate variables, the inverse nature of the problem is clear. The past climates, which must be regarded as random variables, may also be interpreted as

unobserved covariate values

. It is thus natural to put a prior probability distribution on the unobserved covariate values. Various other examples of inverse regression problems are provided in

Chatterjee17.

In this article, we consider two setups of inverse regression and establish almost sure exponential convergence of PBF in general inverse regression for both the setups. These include situations where one of the competing models involve forward regression and the other is associated with inverse regression.

We illustrate our asymptotic results with various theoretical examples in both forward and inverse regression contexts, including forward and inverse variable selection problems. We also follow up our theoretical investigations with simulation experiments in small samples involving Poisson and geometric forward and inverse regression models with relevant link functions and both linear regression and nonparametric regression, the latter modeled by Gaussian processes. We also illustrate variable selection in the aforementioned setups with two different covariates. The results that we obtain are quite encouraging and illuminating, providing useful insights into the behaviour of PBF for forward and inverse parametric and nonparametric regression.

The roadmap for the rest of our paper is as follows. We begin our progress by discussing and formalizing the relevant aspects of forward and inverse regression problems and the associated pseudo-Bayes factors in Section 2. Then in Section 3 we include a brief overview of Shalizi’s approach to treatment of posterior convergence which we usefully exploit for our treatise of PBF asymptotics; further details are provided in Appendix LABEL:subsec:assumptions_shalizi. Convergence of PBF in the forward regression context is established in Section 4, while in Sections 5 and 6 we establish convergence of PBF in the two setups related to inverse regression. In Sections 7 and LABEL:sec:illustrations_inverse we provide theoretical illustrations of PBF convergence in forward and inverse setups, respectively, with various examples including variable selection. Details of our simulation experiments with small samples involving Poisson and geometric linear and Gaussian process regression for relevant link functions, under both forward and inverse setups, are reported in Section LABEL:sec:simstudy, which also includes experiments on variable selection. Finally, we summarize our contributions and provide future directions in Section LABEL:sec:conclusion.

## 2 Preliminaries and general setup for forward and inverse regression problems

Let us first consider the forward regression setup.

### 2.1 Forward regression problem

For , let observed response be related to observed covariate through

 y1∼f(⋅|θ,x1) and yi∼f(⋅|θ,xi,Y(i−1)) for i=2,…,n, (2.1)

where for , and , are known densities depending upon (a set of) parameters , where is the parameter space, which may be infinite-dimensional. For the sake of generality, we shall consider , where is a function of the covariates, which we more explicitly denote as . The covariate , being the space of covariates. The part of

will be assumed to consist of other parameters, such as the unknown error variance. For Bayesian forward regression problems, some prior needs to be assigned on the parameter space

. For notational convenience, we shall denote by , so that we can represent (2.1) more conveniently as

 yi∼f(⋅|θ,xi,Y(i−1)) for i=1,…,n. (2.2)

#### 2.1.1 Examples of the forward regression setup

• , where , where is some appropriate link function and is some function with known or unknown form. For known, suitably parameterized form, the model is parametric. If the form of is unknown, one may model it by a Gaussian process, assuming adequate smoothness of the function.

• , where , where is some appropriate link function and is some function with known (parametric) or unknown (nonparametric) form. Again, in case of unknown form of , the Gaussian process can be used as a suitable model under sufficient smoothness assumptions.

• , where is a parametric or nonparametric function and are Gaussian errors. In particular, may be a linear regression function, that is, , where

is a vector of unknown parameters. Non-linear forms of

are also permitted. Also, may be a reasonably smooth function of unknown form, modeled by some appropriate Gaussian process.

### 2.2 Forward pseudo-Bayes factor

Letting , , and , let denote the posterior density at , given data , and model . Let the density of given and under model be denoted by . Then note that

 π(yi|Yn,−i,Xn,M)=∫Θf(yi|θ,xi,Y(i−1),M)dπ(θ|Yn,−i,Xn,−i,M), (2.3)

where

 π(θ|Yn,−i,Xn,−i,M)∝π(θ)n∏j≠i;j=1f(yj|θ,xj,Y(j−1),M). (2.4)

For any two models and , the forward pseudo Bayes factor (FPBF) of against based on the cross-validation posteriors of the form (2.3) is defined as follows:

 FPBF(n)(M1,M2)=∏ni=1π(yi|Yn,−i,Xn,M1)∏ni=1π(yi|Yn,−i,Xn,M2), (2.5)

and we are interested in studying the limit for almost all data sequences.

### 2.3 Inverse regression problem: first setup

In inverse regression, the basic premise remains the same as in forward regression detailed in Section 2.1. In other words, the distribution , parameter , the parameter and the covariate space remain the same as in the forward regression setup. However, unlike in Bayesian forward regression problems where a prior needs to be assigned only to the unknown parameter , a prior is also required for , the unknown covariate observation associated with known response , say. Given the entire dataset and , the problem in inverse regression is to predict . Hence, in the Bayesian inverse setup, a prior on is necessary. Given model and the corresponding parameters , we denote such prior by . For Bayesian cross-validation in inverse problems it is pertinent to successively leave out ;

, and compute the posterior predictive distribution

, from and the rest of the data (see Bhatta07). But these posteriors are not useful for Bayes of pesudo-Bayes factors even for inverse regression setups. The reason is that the Bayes factor for inverse regression is still the ratio of posterior odds and prior odds associated with the competing models, which as usual translates to the ratio of the marginal densities of the data under the two competing models. The marginal densities depend upon the prior for , however, under the competing models. The pseudo-Bayes factor for inverse models is then the ratio of products of the cross-validation posteriors of , where and are marginalized out. Details of such inverse cross-validation posteriors and the definition of pseudo-Bayes factors for inverse regression are given below.

#### 2.3.1 Inverse pseudo-Bayes factor in this setup

In the inverse regression setup, first note that

 π(~xi,θ|Yn,−i,Xn,−i,M) =π(~xi,θ|M)∏nj≠i;j=1f(yj|θ,xj,Y(j−1),M)∫X∫Θdπ(u,ψ)∏nj≠i;j=1f(yj|ψ,xj,Y(j−1),M) =π(~xi|θ,M)π(θ|M)∏nj≠i;j=1f(yj|θ,xj,Y(j−1),M)∫X∫Θdπ(u|ψ,M)dπ(ψ|M)n∏j≠i;j=1f(yj|ψ,xj,Y(j−1),M) =π(~xi|θ,M)π(θ|M)∏nj≠i;j=1f(yj|θ,xj,Y(j−1),M)∫Θdπ(ψ|M)n∏j≠i;j=1f(yj|ψ,xj,Y(j−1),M)=π(~xi|θ,M)π(θ|Yn,−i,Xn,−i,M). (2.6)

Using (2.6) we obtain

 π(yi|Yn,−i,Xn,−i,M) =∫X∫Θf(yi|θ,~xi,Y(i−1),M)dπ(~xi,θ|Yn,−i,Xn,−i,M), =∫Θg(Y(i),θ,M)dπ(θ|Yn,−i,Xn,−i,M), (2.7)

where

 g(Y(i),θ,M)=∫Xf(yi|θ,~xi,Y(i−1),M)dπ(~xi|θ,M), (2.8)

and is the same as (2.4). For any two models and , the inverse pseudo Bayes factor (IPBF) of against based on cross-validation posteriors of the form (2.7) is given by

 IPBF(n)(M1,M2)=∏ni=1π(yi|Yn,−i,Xn,−i,M1)∏ni=1π(yi|Yn,−i,Xn,−i,M2), (2.9)

and our goal is to investigate for almost all data sequences.

### 2.4 Inverse regression problem: second setup

In the inverse regression context, we consider another setup under which Chat20 establish consistency of the inverse cross-validation posteriors of . Here we consider experiments with covariate observations along with responses . In other words, the experiment considered here will allow us to have samples of responses against each covariate observation , for . Again, both and are allowed to be multidimensional. Let .

For consider the following general model setup: conditionally on , and ,

 yij∼f(⋅|θ,xi,Y(i−1)j); j=1,…,m, (2.10)

independently, where as before.

#### 2.4.1 Prior for ~xi

Following Chat20, we consider the following prior for : given ,

 ~xi∼U(Bim(θ)), (2.11)

the uniform distribution on

 Bim(θ)=({x:H(η(x))∈[¯yi−csi√m,¯yi+csi√m]}), (2.12)

where is some suitable transformation of . In (2.12), and , and is some constant. We denote this prior by . Chat20 show that the density or any probability associated with is continuous with respect to .

#### 2.4.2 Examples of the prior

• , where and for all . Here, under the prior , has uniform distribution on the set .

• , where , with . Here is a known, one-to-one, continuously differentiable function and is an unknown function modeled by Gaussian process. Here, the prior for is the uniform distribution on

 Bim(η)={x:η(x)∈H−1{[¯yi−csi√m,¯yi+csi√m]}}.
• , where , with . Here

is a known, increasing, continuously differentiable, cumulative distribution function and

is an unknown function modeled by some appropriate Gaussian process. Here, the prior for is the uniform distribution on .

• , where is an unknown function modeled by some appropriate Gaussian process, and are zero-mean Gaussian noise with variance . Here, the prior for is the uniform distribution on . If , then the prior for is the uniform distribution on , where and .

Further examples of the prior in various other inverse regression models are provided in Sections LABEL:sec:illustrations_inverse and LABEL:sec:simstudy.

#### 2.4.3 Inverse pseudo-Bayes factor in this setup

For any two models and we define inverse pseudo-Bayes factor for model against model , for any , as

 IPBF(n,m,k)(M1,M2)=∏ni=1π(yik|Ynm,−i,Xn,−i,M1)∏ni=1π(yik|Ynm,−i,Xn,−i,M2) (2.13)

and study the limit for almost all data sequences. Note that since are distributed independently as given any and , it would follow that if the limit exists, it must be the same for all .

Suppose that the true data-generating parameter is not contained in , the parameter space considered. This is a case of misspecification that we must incorporate in our convergence theory of PBF. Our PBF asymptotics draws on posterior convergence theory for (possibly infinite-dimensional) parameters that also allows misspecification. In this regard, the approach presented in Shalizi09 seems to be very appropriate. Before proceeding further, we first provide a brief overview of this approach, which we conveniently exploit for our purpose.

## 3 A brief overview of Shalizi’s approach to posterior convergence

Let , and let and denote the observed and the true likelihoods respectively, under the given value of the parameter and the true parameter . We assume that , where is the (often infinite-dimensional) parameter space. However, we do not assume that , thus allowing misspecification. The key ingredient associated with Shalizi’s approach to proving convergence of the posterior distribution of is to show that the asymptotic equipartition property holds. To elucidate, let us consider the following likelihood ratio:

 Rn(θ)=fθ(Yn)fθ0(Yn).

Then, to say that for each , the generalized or relative asymptotic equipartition property holds, we mean

 limn→∞ 1nlogRn(θ)=−h(θ), (3.1)

almost surely, where is the KL-divergence rate given by

 (3.2)

provided that it exists (possibly being infinite), where denotes expectation with respect to the true model. Let

 h(A) =ess~{}infθ∈A h(θ); J(θ) =h(θ)−h(Θ); J(A) =ess~{}infθ∈A J(θ).

Thus, can be roughly interpreted as the minimum KL-divergence between the postulated and the true model over the set . If , this indicates model misspecification. For , , so that .

As regards the prior, it is required to construct an appropriate sequence of sieves such that and , for some .

With the above notions, verification of (3.1) along with several other technical conditions ensure that for any such that ,

 limn→∞ π(A|Yn)=0, (3.3)

almost surely, provided that .

The seven assumptions of Shalizi leading to the above result, which we denote as (S1)–(S7), are provided in Appendix LABEL:subsec:assumptions_shalizi. In what follows, we denote almost sure convergence by “”, almost sure equality by “” and weak convergence by “”.

## 4 Convergence of PBF in forward problems

Let denote the true model which is also associated with parameter , where is a parameter space containing the true parameter . Then the following result holds.

###### Theorem 1.

Assume conditions (S1)–(S7) of Shalizi, and let the infimum of over be attained at , where . Also assume that and are complete separable metric spaces and that for , and are bounded and continuous in . Then,

 1nlogFPBF(n)(M,M0)=1nlog[∏ni=1π(yi|Yn,−i,Xn,M)∏ni=1π(yi|Yn,−i,Xn,M0)]a.s.⟶−h(~θ), as n→∞, (4.1)

where, for any ,

 (4.2)
###### Proof.

By the hypotheses, (3.3) holds, from which it follows that for any ,

 limn→∞ π(Ncϵ|Yn,−i,Xn,−i,M)=0, (4.3)

where .

Now, by hypothesis, the infimum of over be attained at , where . Then by (4.3), the posterior of given and , given by (2.4), concentrates around , the minimizer of the limiting KL-divergence rate from the true distribution. Formally, given any neighborhood of , the set is contained in for sufficiently small . It follows that for any neighborhood of , , almost surely, as . Since is a complete, separable metric space, it follows that (see, for example, Ghosh03, Ghosal17)

 π(⋅|Yn,−i,Xn,−i,M)w⟶δ~θ(⋅), % almost surely, as n→∞. (4.4)

Then, due to (4.4) and the Portmanteau theorem, as is bounded and continuous in , it holds using (2.3), that

 π(yi|Yn,−i,Xn,M)a.s.⟶f(yi|~θ,xi,Y(i−1),M), as n→∞. (4.5)

Now, due to (4.5),

 1nn∑i=1logπ(yi|Yn,−i,Xn,M)a.s.⟶limn→∞ 1nn∑i=1logf(yi|~θ,xi,Y(i−1),M), as n→∞. (4.6)

Also, essentially the same arguments leading to (4.5) yield

 π(yi|Yn,−i,Xn,M0)a.s.⟶f(yi|θ0,xi,Y(i−1),M0), as n→∞,

which ensures

 1nn∑i=1logπ(yi|Yn,−i,Xn,M0)a.s.⟶limn→∞ 1nn∑i=1logf(yi|θ0,xi,Y(i−1),M0), as n→∞. (4.7)

From (4.6) and (4.7) we obtain

 limn→∞ 1nlogFPBF(n)(M,M0)a.s.=limn→∞ 1nn∑i=1log[f(yi|~θ,xi,Y(i−1),M)f(yi|~θ0,xi,Y(i−1),M0)]a.s.=−h(~θ), (4.8)

where the rightmost step of (4.8), given by (4.2), follows due to (3.1). Hence, the result is proved. ∎

For postulated model , let the KL-divergence rate in (3.2) be denoted by , for .

###### Theorem 2.

For models , and with complete separable parameter spaces , and , assume conditions (S1)–(S7) of Shalizi, and for , let the infimum of over be attained at , where . Also assume that for , ; , and are bounded and continuous in . Then,

 (4.9)

where, for , and for any ,

 hj(θ)=limn→∞ 1nEθ0{n∑i=1log[f(yi|θ0,xi,Y(i−1),M0)f(yi|θ,xi,Y(i−1),Mj)]}. (4.10)
###### Proof.

The proof follows by noting that

 1nlogFPBF(n)(M1,M2)=1nlogFPBF(n)(M1,M0)−1nlogFPBF(n)(M2,M0),

and then using (4.1) for and . ∎

## 5 Convergence results for PBF in inverse regression: first setup

###### Theorem 3.

Assume conditions (S1)–(S7) of Shalizi, and let the infimum of over be attained at , where . Also assume that and are complete separable metric spaces and that for , and are bounded and continuous in . Then,

 1nlogIPBF(n)(M,M0)=1nlog[∏ni=1π(yi|Yn,−i,Xn,−i,M)∏ni=1π(yi|Yn,−i,Xn,−i,M0)]a.s.⟶−h∗(~θ), as n→∞, (5.1)

where, for any ,

 h∗(θ)=limn→∞ 1nn∑i=1log[g(Y(i),θ0,M0)g(Y(i),θ,M)],

provided that the limit exists.

###### Proof.

Since remains the same as in Theorem 1, it follows as before that

 π(⋅|Yn,−i,Xn,−i,M)w⟶δ~θ(⋅), % almost surely, as n→∞.

Then, since is bounded and continuous in , the above ensures in conjunction with the Portmanteau theorem using (