 # Asymptotic Theory of Bayes Factor for Nonparametric Model and Variable Selection in the Gaussian Process Framework

In this paper we consider the Bayes factor approach to general model and variable selection under a nonparametric, Gaussian process framework. Specifically, we establish that under reasonable conditions, the Bayes factor consistently selects the correct model with exponentially fast convergence rate. If the true model does not belong to the postulated model space, then the Bayes factor asymptotically selects the best possible model in the model space with exponentially fast convergence rate. We derive several theoretical applications of the proposed method to various setups like the simple linear regression, reproducing kernel Hilbert space (RKHS) models, autoregressive (AR) models, and combinations of such models.

Comments

There are no comments yet.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Let, for , and denote the

-th response variable and the associated vector of covariates. We assume that the covariate

consists of components, and that it is required to select a subset of the components that best explains the response variable .

Let denote any subset of the indices . We denote by the co-ordinates of associated with . To relate to we consider the following nonparametric regression setup:

 y=f(xs)+ϵ, (1)

where is the random error and the function is considered unknown. We assume that , where .

By assuming this framework we include the possibility that can be a one-dimensional, two-dimensional and so on a -dimensional function. We further assume that there exists a set of regressors which truly influences the dependent variable , and therefore function is the true function. Our problem is to identify , i.e., the set of truly active regressors. Note that we have not consider any specific form of the function. Irrespective of the form of the function, we are only interested in identifying the active regressors.

## 2 Notations and concepts

By a model, here we mean a particular subset . Our aim to find the true model among all possible candidate models.

For any where is the number of components in , we represent using basis functions as follows:

 f(xs)=∞∑j=1ajKj(xs). (2)

We assume that , henceforth abbreviated as , denotes the -th basis function spanning the relevant Hilbert space equipped with some appropriate inner product . can be expressed as the product of individual basis functions as

 Kj,s=∏ℓ∈sKjℓ(xℓ,s).

Here , henceforth , stands for the -th basis function for the -th component of the vector . We make the following assumption regarding :

• For , is uniformly bounded.

Note that assumption (A1) implies that are uniformly bounded for all . Using assumption (A1), we have

 ∥f∥≲∞∑j=1|aj|, (3)

where stands for up to some positive multiplicative constant.

To ensure that almost surely we need to choose the prior on carefully. In this regard we assume the following:

• , where ’s and ’s satisfy

 ∞∑j=1∣∣mj∣∣ <∞; (4) ∞∑j=1σj <∞. (5)

The above two convergence assumptions ensure, by virtue of simple application of Kolmogorov’s three series theorem characterizing series convergence (see Chow and Teicher (1988)), that

 ∞∑j=1|aj|<∞,almost surely. (6)

The proof of (6) is provided in Appendix A. Now (6) guarantees that almost surely, via (3). Therefore, almost surely belongs to the Hilbert space spanned by the basis functions.

Hence, the prior on is a Gaussian process with mean

 μ(⋅)=∞∑j=1mj∏ℓ∈sKjℓ(⋅). (7)

The covariance between and is given by

 Cov(f(xs1),f(xs2)) =∞∑j=1σ2j∏ℓ1∈s1Kjℓ1(xℓ1,s1)∏ℓ2∈s2Kjℓ2(xℓ2,s2). (8)

By assumption (A1) that is uniformly bounded, it is guaranteed using (4) and (5) that both (7) and (8) are well-defined.

Now, for the dataset , where denotes the available -th covariate vector associated with the indices , (7) and (8) yield the -component mean vector and the -dimensional covariance matrix, given by

 μn,s =(μ(x1,s),…,μ(xn,s))T; (9) Σn,s =((Cov(f(xi,s),f(xj,s)));  i=1,…,n; j=1,…,n. (10)

The marginal distribution of is then the -variate normal, given by

 yn∼Nn(μn,s,σ2ϵIn+Σn,s), (11)

where

is the identity matrix of order

. We denote this marginal model by .

##### The true model:

We assume that there exists exactly one particular subset of which is actually associated with the data generating process of . We term this subset as the true subset. The evaluation procedure of the proposed set of model selection basically rests on its ability to identify this true subset, irrespective of the form of the function . In a sense that once such a set is identified, considerable amount of time and money could be saved by discarding the other regressors in future research, and this does not depend on the functional form of relation between the response and the regressors.

Let us denote the true subset of covariate indices by , and the true set of uniformly bounded basis functions by

 ⎧⎨⎩Kj,s0=∏ℓ∈s0Kjℓ; j=1,2,…⎫⎬⎭.

To distinguish the true model from the rest we add a index to the coefficients of the true model. The true function is then given by

 ft(xs0)=∞∑j=1atj∏ℓ∈s0Kjℓ(xℓ,s0), (12)

where , with and . We denote the mean vector and the covariance matrix of the Gaussian process prior associated with (12) by and , respectively. We denote the corresponding marginal distribution of as .

The Bayes factor of any model to the true model associated with the data given uniform prior distribution on the model space is given by

 BFns,s0 = Ms(yn)Mts0(yn) (13) =

Consider the following lemma stating the expressions for the expectation and variance of logarithm of

. The proof is in the supplementary file.

###### Lemma 1.

Under the given setup, the expectation and variance of the Bayes factor of any subset of regressors and the true subset under the true subset is given as follows:

 Es0[logBFns,s0] =−12log∣∣σ2ϵIn+Σn,s∣∣+12log∣∣σ2ϵIn+Σtn,s0∣∣−12tr[(Σtn,s0−Σtn,s) (14) (σ2ϵIn+Σn,s)−1]−12(μn,s−μtn,s0)T(σ2ϵIn+Σn,s)−1(μn,s−μtn,s0).
 Vars0[logBFns,s0]=12tr[{In−(σ2ϵIn+Σn,s)−1(σ2ϵIn+Σtn,s0)}2] +Covs0[(yn−μtn,s0)T{(σ2ϵIn+Σtn,s0)−1−(σ2ϵIn+Σn,s)−1}(yn−μtn,s0), (μn,s−μtn,s0)T(σ2ϵIn+Σn,s)−1(yn−μtn,s0)]+(μn,s−μtn,s0)T (σ2ϵIn+Σn,s)−1(σ2ϵIn+Σtn,s0)(σ2ϵIn+Σn,s)−1(μn,s−μn,s0) (15)

For any square matrix , let denote its

-th eigenvalue, i.e.,

. For our purpose, let the eigenvalues be arranged in the decreasing order.

## 3 Weak consistency / probability convergence

In this section we modify the assumptions as follows:

• Let . We assume that for all , as ,

 1nΔn,s→ξs,

where .

To proceed, recall that , where denotes the appropriate lower triangular matrix associated with the Cholesky factorization, and , with . Then

 (yn−μtn,s0)T(σ2ϵIn+Σtn,s0)−1(yn−μtn,s0)=zTnzn.

It also follows that,

 (yn−μtn,s0)T(σ2ϵIn+Σn,s)−1(yn−μtn,s0)=zTnBTs0(σ2ϵIn+Σn,s)−1Bs0zn. (16)

Let , and let us make the following additional assumptions

• , where .

•  λ1(An,s)=O(1), as n→∞; (17) λ1[(σ2ϵIn+Σn,s)−1(σ2ϵIn+Σtn,s0)(σ2ϵIn+Σn,s)−1]=O(1), as n→∞. (18)
• as , where .

###### Theorem 1.

Assume () – (). Then

 limn→∞ Es0(1nlogBFns,s0)=−δs, (19)

where .

###### Proof.

From (13) we find that the expectation of logarithm of the Bayes factor is given by

 1nEs0[log(BFns,s0)] =12nlog∣∣σ2ϵIn+Σtn,s0∣∣∣∣σ2ϵIn+Σn,s∣∣ −12nEs0[(yn−μn,s)T(σ2ϵIn+Σn,s)−1(yn−μn,s)] +12nEs0[(yn−μtn,s0)T(σ2ϵIn+Σtn,s0)−1(yn−μtn,s0)]. (20)

To evaluate the first part in the above equation, note that

 12nlog∣∣σ2ϵIn+Σtn,s0∣∣∣∣σ2ϵIn+Σn,s∣∣ = 12nlogn∏i=1σ2ϵ+λi(Σtn,s0)σ2ϵ+λi(Σn,s) (21) = 12nn∑i=1log[σ2ϵ+λi(Σtn,s0)σ2ϵ+λi(Σn,s)] → 12logcs, as n→∞  [due to (A6)].

Note that , due to (A5).

For the second term of (20) we obtain

 12nEs0[(yn−μn,s)T(σ2ϵIn+Σn,s)−1(yn−μn,s)] =12ntr[(σ2ϵIn+Σn,s)−1(σ2ϵIn+Σtn,s0)]+12ntr[(σ2ϵIn+Σn,s)−1(μn,s−μtn,s0)(μn,s−μtn,s0)T] =12ntr(An,s)+12n(μn,s−μn,s0)T(σ2ϵIn+Σn,s)−1(μn,s−μtn,s0) →ζs2+ξs2, as n→∞  [due to (A3) and (A4)]. (22)

The last term of (20) is given by

 12nEs0[(yn−μn,s0)T(σ2ϵIn+Σn,s0)−1(yn−μn,s0)]=12, (23)

so that combining (21), (22) and (23) yields

 1nEs0[log(BFns,s0)]→−δs, as n→∞. (24)

The result (19) follows from (24). ∎

Our next theorem shows that , as .

###### Theorem 2.

Under assumptions () – (),

 Es0[1nlog(BFns,s0)+δs]2→0, (25)

as .

Instead of proving Theorem 2 we shall prove a stronger version of the theorem in Section 4 in the context of almost sure convergence.

Combining Theorems 1 and 2 and applying Chebychev’s inequality, we obtain the following theorem:

###### Theorem 3.

Under assumptions () – (),

 1nlog(BFns,s0)P⟶−δs. (26)

## 4 Almost sure convergence

Now, let us replace assumption () with the slightly stronger assumption

• , for .

###### Theorem 4.

Assume (), (), (), (), () and (). Then

 ∞∑n=1Es0[1nlog(BFns,s0)+δs]4<∞. (27)
###### Proof.

For convenience, we shall work with

 Es0[1n(log(BFns,s0)−~En)+1n~En+δs]4,

where . Observe that

 Es0[1n(log(BFns,s0)−~En)+1n~En+δs]4≤8{n−4Es0[log(BFns,s0)−~En]4+[1n~En+δs]4}. (28)

Now note that

 Es0[log(BFns,s0)−~En]4 =Es0[−12{zTnAn,szn−Es0(zTnAn,szn)}+(yn−μtn,s0)T(σ2ϵIn+Σn,s)−1(μn,s−μtn,s0) +12{(yn−μtn,s0)T(σ2ϵIn+Σtn,s0)−1(yn−μtn,s0)−n}]4 ≤Es0[12∣∣zTnAn,szn−Es0(zTnAn,szn)∣∣+∣∣(yn−μtn,s0)T(σ2ϵIn+Σn,s)−1(μn,s−μtn,s0)∣∣ +12∣∣(yn−μtn,s0)T(σ2ϵIn+Σtn,s0)−1(yn−μtn,s0)−n∣∣]4 ≤C[Es0{zTnAn,szn−Es0(zTnAn,szn)}4+Es0{(yn−μtn,s0)T(σ2ϵIn+Σn,s)−1(μn,s−μtn,s0)}4 +Es0{(yn−μtn,s0)T(σ2ϵIn+Σtn,s0)−1(yn−μtn,s0)−n}4], (29)

where is a positive constant. The above result follows by repeated application of the inequality , for non-negative , , where .

Let us first obtain the asymptotic order of . Note that

 Es0{zTnAn,szn−Es0(zTnAn,szn)}4 =Es0(zTnAn,szn)4−4Es0(zTnAn,szn)3Es0(zTnAn,szn)+6Es0(zTnAn,szn)2{Es0(zTnAn,szn)}2 −4Es0(zTnAn,szn){Es0(zTnAn,szn)}3+{Es0(zTnAn,szn)}4. (30)

The following results (see, for example, Magnus (1978), Kendall and Stuart (1947)) will be useful for our purpose.

 Es0(zTnAn,szn) =tr(An,s); (31) Es0(zTnAn,szn)2 =[tr(An,s)]2+2tr(A2n); (32) Es0(zTnAn,szn)3 =[tr(An,s)]3+6tr(An,s)tr(A2n)+8tr(A3n); (33) Es0(zTnAn,szn)4 =[tr(An,s)]4+32tr(An,s)tr(A3n)+12[tr(A2n)]2 +12[tr(An,s)]2tr(A2n)+48tr(A4n). (34)

Substituting (31), (32), (33) and (34) in (30) we obtain

 (35)

Since is positive definite for any , it follows from Lemma 2 of the Appendix that for any ,

 tr(Akn)≤λ1(An,s)tr(Ak−1n)≤(λ1(An,s))2tr(Ak−2n)≤⋯≤(λ1(An,s))k−1tr(An,s). (36)

Now, , as by (A5) (17). Hence, it follows using (36) and(A4), that for ,

 tr(Akn)=O(n). (37)

Substituting (37) in (35) we see that

 (38)

Let us now obtain the asymptotic order of . Note that is univariate normal with mean zero and variance

 ^σ2n=(μn,s−μtn,s0)T(σ2ϵIn+Σn,s)−1(σ2ϵIn+Σtn,s0)(σ2ϵIn+Σn,s)−1(μn,s−μtn,s0).

Now () holds if and only if there exists such that is non-negative definite, for large enough . Hence, further using () and () to see that is uniformly bounded for all , we obtain . Hence it follows that

 Es0{(yn−μtn,s0)T(σ2ϵIn+Σn,s)−1(μn,s−μtn,s0)}4=3^σ4n=O(n2). (39)

Finally, we deal with which is the same as . Since , where, for , , by Lemma B of Serfling (1980) (page 68), it follows that

 Es0(zT