Bayesian model selection consistency and oracle inequality with intractable marginal likelihood

01/02/2017 ∙ by Yun Yang, et al. ∙ Florida State University 0

In this article, we investigate large sample properties of model selection procedures in a general Bayesian framework when a closed form expression of the marginal likelihood function is not available or a local asymptotic quadratic approximation of the log-likelihood function does not exist. Under appropriate identifiability assumptions on the true model, we provide sufficient conditions for a Bayesian model selection procedure to be consistent and obey the Occam's razor phenomenon, i.e., the probability of selecting the "smallest" model that contains the truth tends to one as the sample size goes to infinity. In order to show that a Bayesian model selection procedure selects the smallest model containing the truth, we impose a prior anti-concentration condition, requiring the prior mass assigned by large models to a neighborhood of the truth to be sufficiently small. In a more general setting where the strong model identifiability assumption may not hold, we introduce the notion of local Bayesian complexity and develop oracle inequalities for Bayesian model selection procedures. Our Bayesian oracle inequality characterizes a trade-off between the approximation error and a Bayesian characterization of the local complexity of the model, illustrating the adaptive nature of averaging-based Bayesian procedures towards achieving an optimal rate of posterior convergence. Specific applications of the model selection theory are discussed in the context of high-dimensional nonparametric regression and density regression where the regression function or the conditional density is assumed to depend on a fixed subset of predictors. As a result of independent interest, we propose a general technique for obtaining upper bounds of certain small ball probability of stationary Gaussian processes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A Bayesian framework offers a flexible and natural way to conduct model selection by placing prior weights over different models and using the posterior distribution to select a best one. However, unlike penalization-based model selection methods, there is a lack of general theory understanding large sample properties of Bayesian model selection procedures from a frequentist perspective. As a motivating example, we consider the problem of selecting a model from a sequence of nested models. Under this special example, the Occam’s Razor [4] phenomenon suggests that a good model selection procedure is expected to select the smallest model that contains the truth. This example motivates us to investigate the consistency of a Bayesian model selection procedure, that is, whether the posterior tends to concentrate all its mass on the smallest model space that contains the true data generating model.

In the frequentist literature, most model selection methods are based on optimization, where penalty terms are incorporated to penalize models with higher complexity. A large volume of the literature focuses on excess risk bounds and oracle inequalities, which are characterized via either some global measures of model complexity [36, 39, 3, 7, 16] that typically yield a suboptimal “slow rate”, or some improved local measures of the complexity [16, 2, 19] that yield an optimal “fast rate” [1, 26]

. An overwhelming amount of recent literature on penalization methods for high-dimensional statistical problems can also be analyzed under the model selection perspective. For example, in high dimensional linear regression, the famous Lasso 

[31] places an penalty to induce sparsity, which can be considered as selecting a model from the model space consisting a sequence of balls with increasing radius; in sparse additive regression, by viewing the model space as all additive function spaces involving different subset of covariates with each univariate function lying in a univariate Reproducing kernel Hilbert space (RKHS) with increasing radius. [27] proposed a minimax-optimal penalized method with a penalty term proportional to the sum of empirical norms and RKHS norms of univariate functions.

In the classical literature of Bayesian model selection in low dimensional parametric models, most results on model selection consistency relies on the critical property that the log-likelihood function can be locally approximated by a quadratic form of the parameter (such as the local asymptotic normality property) under a set of regularity assumption in the asymptotic regime when the sample size

tends to infinity. There is a growing body of literature which has provided some theoretical understanding of Bayesian variable selection for linear regression with a growing number of covariates, which is a special case of the model selection. In the moderate-dimension scenario (the number of covariates is allowed to grow with the sample size, but ), [29]

established variable selection consistency in a Bayesian linear model, meaning that the posterior probability of the true model that contains all influential covariates tends to one as

grows to infinity. [15] showed a selection inconsistency phenomenon for using several commonly used mixture priors, including local mixture (point mass at zero and slab prior with non-zero value at null-value 0 of the slab density) priors, when is larger than the order of . To address this, they advocated the use of a non-local mixture priors (slab density has value at null value ) and obtained selection consistency when the dimension is . [8] provided several conditions on the design matrix and the minimum signal strength to ensure selection consistency with local priors when . [24] considered selection consistency using a spike and slab local prior in a high-dimensional scenario where can grow nearly exponentially with . [40] showed variable selection consistency of high-dimensional Bayesian linear regression, where a prior is directly placed over the model space that penalizes each covariate in the model by a factor of

; in this setting, they showed a particular Markov chain Monte Carlo algorithm for sampling from the model space is rapidly mixing, meaning that the number of iterations required for the chain to converge to an

-distance of stationary distribution from any initial configuration is at most polynomial in . Although the aforementioned results on variable selection consistency in Bayesian linear models are promising, their proofs are all based on analyzing closed form expressions of the marginal likelihood function, that is, the likelihood function integrated with respect to the conditional prior distribution of the parameters given the model.

The assumption of either an existence of a closed form expression of the marginal likelihood function or the existence of a local asymptotic quadratic approximation of the log-likelihood function significantly impedes the applicability of the current proof techniques to general model selection problems. For example, this assumption precludes the case when the parameter space is infinite dimensional, such as a space of functions or conditional densities indexed by predictors. To the best of our knowledge, little is known about the model selection consistency of an infinite dimensional Bayesian model or in general when the marginal likelihood is intractable. The only relevant work in this direction is [12]

, where they considered Bayesian nonparametric density estimation under an unknown regularity parameter, such as the smoothness level, which serves as the model index. In this setting, they showed that the posterior distribution tends to give negligible weight to models that are bigger than the optimal one, and thus selects the optimal model or smaller models that also approximate the true density well.

The goal of the current paper is to build a general theory for studying large sample properties of Bayesian model selection procedures, for example, model selection consistency and oracle inequalities. We show that in the Bayesian paradigm, the local Bayesian complexity, defined as

times the negative logarithm of the prior probability mass assigned to certain Kullback-Leibler divergence ball around the true model, plays the same role as the local complexity measures in oracle inequalities of penalized model selection methods. For example, when the conditional prior within each model is close to a “uniform” distribution over the parameter space, the local Bayesian complexity becomes similar to a local covering entropy, recovering classical results

[18] by Le Cam for charactering rate of convergence using local entropy conditions. In the special case of parametric models, when the prior is thick at the truth (which is a common assumption made in Bernstein-von Mises type results, refer to [32]), the local Bayesian complexity is , which scales linearly in the dimension of the parameter space, recovering classical asymptotic theory on Bayesian model selection using the Bayesian information criterion (BIC). In this article, we build oracle inequalities for Bayesian model selection procedures using the notion of local Bayesian complexity. Our oracle inequality implies that by properly distributing prior mass over different models, the resulting posterior distribution adaptively allocates its mass to models with the optimal rate of posterior convergence, revealing the adaptive nature of averaging-based Bayesian procedures.

Under an appropriate identifiability assumption on the true model, we show that a Bayesian model selection procedure is consistent, that is, the probability of selecting the “smallest” model that contains the truth tends to one as goes to infinity. Here, the size of a model is determined by its local Bayesian complexity. In concrete examples, in order to show that a Bayesian model selection procedure tends to select the model that contains the truth and is smallest in the physical sense (for example, in variable selection case, the smallest model is the one exactly contains all influential covariates), we impose a prior anti-concentration condition, requiring the prior mass assigned by large models to a neighborhood of the truth to be sufficiently small. As a result of independent interest, in our proof for variable selection consistency of Bayesian high dimensional nonparametric regression using Gaussian process (GP) priors, we propose a general technique for obtaining upper bounds of certain small ball probability of stationary GP. The results complement the lower bound results obtained in [17, 20, 34].

Our results reveal that in the framework of Bayesian model selection, averaging based estimation procedure can gain advantage over optimization based procedures [6] in that i) the derivation on the convergence rate is simpler as the expectation exchanges with integration; elementary probability inequalities such as Chebyshev’s inequality and Markov’s inequality can be used as opposed to more sophisticated empirical process tools for analysing optimization-based estimators; ii) the averaging based approach offers a more flexible framework to incorporate additional information and achieve adaptation to unknown hyper or tuning parameters, due to its average case analysis nature, which is different from the worse case analysis of the optimization based approach. Overall, our results indicate that a Bayesian model selection approach naturally penalizes larger models since the prior distribution becomes more dispersive and the prior mass concentrating around the true model diminishes, manifesting Occam’s Razor phenomenon. This renders a Bayesian approach to be naturally rate-adaptive to the best model by optimally trading off between goodness-of-fit and model complexity.

The remainder of this paper is organized as follows. In §1.1, we introduce notations to be used in the subsequent sections. In §2, we introduce the background and formulate the model selection problem. The assumptions required for optimum posterior contraction rate are discussed in §2.1 with corresponding PAC-Bayes bounds in §2.2. The main results are stated in §3 with the Bayesian model selection consistency theorems in §3.1 and Bayesian oracle inequalities in §3.2. In §3.3 and §3.4, we discuss applications of the model selection theory in the context of high-dimensional nonparametric regression and density regression where the regression function or the conditional density is assumed to depend on a fixed subset of predictors.

1.1 Notations

Let and

stand for the Hellinger distance and Kullback-Leibler divergence, respectively, between two probability density functions

and relative to a a common dominating measure . We define an additional discrepancy measure . For any , let

(1)

denote the Rényi divergence of order . Let us also denote by the quantity , which we shall refer to as the -affinity. When , the -affinity equals the Hellinger affinity. Moreover, for any , implying that for any and equality holds if and only if . Relevant inequalities and properties related to Rényi divergence can be found in [35]. Let denote the -covering number of the space with respect to a semimetric . Operator “” denotes less or equal up to a multiplicative positive constant relation. For a finite set , let denote the cardinality of . The set of natural numbers is denoted by . The -dimensional simplex is denoted by . stands for the identity matrix. Let denote a multivariate normal density with mean and covariance matrix .

2 Background and problem formulation

Let be a sequence of statistical experiments with observations , where is the parameter of interest living in an arbitrary parameter space , and is the sample size. Our framework allows the observations to deviate from identically or independently distributed setting (abbreviated as non-i.i.d.) [13]. For example, this framework covers Gaussian regression with fixed design, where observations are independent, but nonidentically distributed (i.n.i.d.). For each , let admit a density relative to a -finite measure . Assume that is jointly measurable relative to , where is a -field on .

For a model selection problem, let be the model space consisting of a countable number of models of interest. Here, is a countable index set, is the model indexed by , and is the associated parameter space. Assume that the union of all ’s constitutes the entire parameter space , that is . We consider the model selection problem in its full generality by allowing to be arbitrary. For example, they can be Euclidean spaces with different dimensions or spaces of functions depending on different subsets of covariates. Moreover, ’s may overlap or have inclusion relationships.

We use the notation to denote the true parameter, also referred to as the truth, corresponding to the data generating model . Let denote the index corresponding to the smallest model that contains . More formally, a smallest model is defined as

(2)

Here, we have made an implicit assumption that for any , there always exists a unique model such that the preceding display is true. This assumption rules out pathological cases where may belong to several incomparable models and a smallest model cannot be defined.

Let be the prior weight assigned to model and be the prior distribution over in , that is, is the conditional prior distribution of given model being selected. Under this joint prior distribution on

, we obtain a joint posterior distribution using Bayes Theorem,

(3)

By integrating this posterior distribution over or , we obtain respectively the marginal posterior distribution of the model index or the parameter . In this paper, we also consider a class of quasi-posterior distributions obtained by using the -fractional likelihood [38, 21, 6], which is the usual likelihood raised to power ,

Let denote the qausi-posterior distribution, also referred to as the -fractional posterior distribution, obtained by combining the fractional likelihood with the prior ,

(4)

The posterior distribution in (3) is a special case of the fractional posterior with . For this reason, we also refer the posterior in (3) as the regular posterior distribution. As described in [6] (refer to Section 2.1

of the current article for a brief review), the development of asymptotic theory for fractional posterior distributions demands much simpler conditions than the regular posterior while maintaining the same rate of convergence. However, the downside is two-fold i) the credible intervals from the fractional posterior distribution maybe

times wider than those from the normal posterior, at least for the regular parametric models where the Bernstein-von Mises theorem holds; ii) the simplified asymptotic results only apply for a certain class of distance measures

. The first downside can be remedied by post-processing the credible intervals, for example, reduce its width from the center by a factor of ; and the second by considering a general class of risk function-induced fractional quasi-posteriors. We leave the latter as a topic of future research.

Definition:

We say that a Bayesian procedure has model selection consistency if

where either or , depending on whether regular or fractional posterior distribution is used.

Our general framework allows the truth to belong to multiple, even infinitely many ’s, for example, when is a sequence of nested models . In such a situation, the Occam’s Razor principle suggests that a good statistical model selection procedure should be able to select the most parsimonious model that fits the data well. This criterion is consistent with our definition of Bayesian model selection consistency, which requires the marginal posterior distribution over the model space to concentrate on the smallest model that contains . If a Bayesian procedure results in model selection consistency, then we can define a single selected model , with its model index being selected as

which is the posterior mode over the index set. This model selection procedure satisfies the model selection consistency criterion under the frequentist perspective, i.e.,

2.1 Contraction of posterior distributions

In this subsection, we review the theory [11, 13, 6] on the contraction rate of regular and fractional posterior distributions. For notational simplicity, we drop the dependence on the model index in the current (§2.1) and the next (§2.2) subsections, and the results can be applied to any with . Recall that the observations are realizations from the data generating model , where the true parameter may or may not belong to the parameter space . When does not belong to , the model is considered to be misspecified.

Regular posterior distribution:

We consider the regular posterior distribution

(5)

We introduce three common assumptions that are adopted in the literature [11, 13]. Let be a semimetric on to quantify the distance between and .

Assumption A1 (Test condition):

There exist constants and such that for every and with , there exists a test such that

(6)

Assumption A1 ensures that is statistically identifiable, which guarantees the existence of a test function for testing against parameters close to in , and provides upper bounds for its Type I and II errors. In the special case when ’s are i.i.d. observations and is the Hellinger distance between and , such a test always exists [13].

Assumption A2 (Prior concentration):

There exist a constant and a sequence with such that

Here for any , is defined as the following -KL neighbourhood in around ,

When the model is well-specified, a common procedure to verify Assumption A2 is to show that it holds for all in . This stronger version of the parameter prior concentration condition characterizes the compatibility of the prior distribution with the parameter space. In fact, this assumption together with Assumption A3 below implies that the prior distribution is almost “uniformly distributed” over . In fact, if the covering entropy is of the order , then there are roughly disjoint -balls in the parameter space. Assumption A2 requires each ball to receive mass , which matches up to a constant in the exponent of the average prior probability mass received by those disjoint -balls.

Assumption A3 (Sieve sequence condition):

For some constant , there exists a sequence of sieves , , such that

Assumption A3 allows to focus our attention to the most important region in the parameter space that is not too large, but still possesses most of the prior mass. Roughly speaking, the sieve can be viewed as the effective support of the prior distribution at sample size . The construction of the sieve sequence is required merely for the purpose of a proof and is not need for implementing the methods, which is different from sieve estimators.

Theorem 1 (Contraction of the regular posterior [13]).

If Assumptions A1-A3 hold, then there exists a sufficiently large constant such that the posterior distribution (3) satisfies

We say that Bayesian procedure for model has a posterior contraction rate at least relative to if the first display in Theorem 1 holds, or simply call as the posterior contraction rate relative to when no ambiguity occurs. If we have an additional prior anti-concentration condition

(7)

where and is a sufficiently large constant, then by defining a new sieve sequence in Assumption A3 as

we have the following two-sided posterior contraction result, as a direct consequence of Theorem 1.

Corollary 1 (Two sided contraction rate).

Under the assumptions of Theorem 1, if (7) is true for some sufficiently large , then

As we will show in the next section, a prior anti-concentration condition similar to (7) plays an important role for Bayesian model selection consistency in ensuring overly large models that contain to receive negligible posterior mass as .

Fractional posterior distributions:

Now let us turn to the fractional posterior distribution below that is based on fractional likelihood [38, 21, 6],

(8)

Observe that the fractional-likelihood implicitly penalizes all models that are far away from the true model through the following identity,

(9)

Hence by dividing both the denominator and numerator in (8) with an -independent quantity , we can get rid of the test condition A1 and the sieve sequence condition A3 that are used to build a test procedure to discriminate all far away models in the theory for the contraction of the regular posterior distribution. It is discussed in [6] that some control on the complexity of the parameter space is also necessary to ensure the consistency of the regular posterior distribution [11, 13]. Therefore, the fractional posterior distribution is, at least theoretically, more appealing than the regular posterior distribution, since the prior concentration condition A2 alone is sufficient to guarantee the contraction of the posterior (see Theorem 2 below). More discussions and comparisons can be found in [6]. For notational convenience, we assume the shorthand for .

Theorem 2 (Contraction of the fractional posterior [6]).

If Assumption A2 holds, then there exists a sufficiently large constant independent of such that the fractional posterior distribution (8) with order satisfies

and for any subset satisfying , where is a sufficiently large constant independent of ,

In the special case of i.i.d. observations, the -divergence becomes , where is the -divergence for one observation. In this case, Theorem 2 shows that a Bayesian procedure using the fractional posterior distribution has a posterior contraction rate of at least relative to the average -Rényi divergence . Similar to corollary (1) for the regular posterior distribution, if we have an additional prior anti-concentration condition

(10)

where and is a sufficiently large constant, then we have the following two-sided posterior contraction result as a corollary.

Corollary 2.

Under the assumptions of Theorem 2, if (10) is also true for a sufficiently large constant , then for some sufficiently large constant ,

2.2 PAC-Bayes bound

Model selection is typically a more difficult task than parameter estimation. As we will see in the next section, Bayesian model selection consistency is stronger than the property that the (fractional) posterior achieves a certain rate of contraction. In fact, Bayesian model selection consistency requires an additional identifiability condition (see Assumption B1 in the next section) on the true model that assumes a proper gap between models that do not contain and models containing . However, a posterior contraction rate is still attainable even when such misspecified models receive considerable posterior mass, as long as can be approximated by parameters in those models with -error not exceeding . Therefore, a Bayesian procedure may still achieve the estimation optimality without model selection consistency. In our model selection framework, such a property can be best captured and characterized by a Bayesian version of the frequentist oracle inequality — a PAC-Bayes type inequality [22, 23]. For convenience and simplicity, we only focus on the fractional posterior distribution (8), which only requires Assumption A2. Similar results also apply to the regular posterior distribution (1) when more technical conditions such as Assumptions A1 and A3 are assumed.

Under the setup and notation in Section 2.1, a typical PAC-Bayes type inequality takes the form as

for all probability measure that is absolutely continuous with respect to the prior . Here is the risk function, is certain function that measures the discrepancy between and on the support of the measure , is a tuning parameter and Rem is a remainder term. The PAC-Bayes inequality ([6]) we review is for the fractional posterior distribution (8), where the risk function is a multiple of the -Renyi divergence in (9), and a multiple of the negative log-likelihood ratio between and ,

Theorem 3 (PAC Bayes inequality [6]).

Fix . Then,

(11)

holds with probability at least .

In particular, if we choose the measure to be the conditional prior with being the KL neighborhood in Assumption A2, then we have the following corollary.

Corollary 3 (Bayesian oracle inequality [6]).

Let satisfy and fix . Then, with probability at least ,

(12)

Note the posterior expected risk always provides a upper bound on the estimation loss of the posterior mean

, since the loss function

is convex in its first argument for any . Corollary 3 shows that the posterior estimation risk is a trade-off between two terms: the term due to the approximation error, and the term characterizes the prior concentration at . Similar to the discussion after Assumption A2, this second term can also be viewed as a measurement of the local covering entropy of the parameter space around if the prior is “compatible” to , and therefore reflects the local complexity of the parameter space.

Definition:

For a model , we define its local Bayesian complexity at parameter with radius as

where is given in Assumption A2. Here may or may not belong to .

According to Corollary 3, the Bayesian risk of an -fractional posterior is bounded up to a constant by the square critical radius , where is the smallest solution of

(more discussions on the local Bayesian complexity can be found in [6]). Using this notation, Assumption A2 can be translated to the fact that the local Bayesian complexity at truth with radius is upper bounded by . When we refer to a local Bayesian complexity without mentioning its radius, we implicitly mean a local Bayesian complexity with the critical radius . Moreover, we will refer to a Bayesian risk bound like (12) as a Bayesian oracle inequality.

3 Model selection consistency and oracle inequalities for Bayesian procedures

In this section, we start with our main result on the Bayesian model selection consistency for both the regular and fractional posterior distribution under some suitable identifiability condition on the truth. Next, we present Bayesian oracle inequalities for Bayesian model selection procedures that hold without the strong identifiability condition. Our Bayesian oracle inequality characterizes a trade-off between the approximation error and a local complexity of the model via the local Bayesian complexity, and illustrates the adaptive nature of averaging-based Bayesian procedures towards achieving an optimal rate of posterior convergence. Finally, we apply our theory to high dimensional nonparametric regression and density regression with variable selection.

3.1 Bayesian model selection consistency

Let us first introduce a few notations. Recall that is the index corresponding to the smallest model that contains . We use to mean all models that are different from , but contain . According to our assumption (2) on the model space, we have for each . Similarly, we use the notation to mean all models that do not contain . Under this notation, the three subsets , and form a partition of the index set . As a common practice, when is a set, we define as the infimum -distance between and any point in . is defined in a similar way for the -divergence . We make the following additional assumptions for Bayesian model selection consistency.

Assumption B1 (Prior concentration for model selection):

  1. (Parameter space prior concentration). There exists a sequence, denoted by , such that Assumption A2 holds with , and .

  2. (Model space prior concentration). The prior weight for satisfies .

The parameter space prior concentration requires the prior distribution associated with the target model to put enough mass into a KL neighbourhood around the truth . However, we do not need any condition on the prior concentration for other ’s with . The model space prior concentration condition requires a lower bound on the prior mass assigned to the target model , which is in the same spirit as the first part on parameter space prior concentration. In fact, if for some sufficiently large constant , then Theorem 1 with implies that the marginal posterior probability of tends to zero in probability as grows no matter how well it fits the data.

Assumption B2 (Parameter space prior anti-concentration):

For each , there exists a sequence, denoted by , such that , and for the regular posterior distribution, holds for some sufficiently large constants and ; for the fractional posterior distribution, .

This assumption ensures that the posterior probability of overly large models that contains the truth tends to zero as . The first part that for any requires the “anti-contraction rate” to honestly reflect the complexity of its associated model — associated with is expected to become slower as its parameter space becomes larger. We remark that the () defined in Assumption B2 can be identified with the rate in the lower bound in Corollary 1 or Corollary 2 with and . This is the reason we call it an “anti-contraction rate”. According to Corollary 1 or Corollary 2

, Assumption B2 implies that the Bayes factor of

versus

(13)

becomes exponentially small as , where or depending on whether the regular or the fractional posterior distribution is used. A combination of this fact with the model space prior concentration condition in A1 implies the posterior probability of , the set of all model indices corresponding to overly large models, tends to zero as .

Assumption B3 (Model identifiability):

There exists , where is some sufficiently large constant independent of , such that for all , we have for the regular posterior distribution, or for the fractional posterior distribution.

The separation gap in Assumption B3 is the best approximation error of using elements in the parameter space corresponding to such misspecified models. The lower bound condition requires the best approximation error of misspecified models to be larger than the estimation error associated with , so that the Bayes factor defined in (13) for also becomes exponentially small as , implying that is identifiable. In the special case of high dimensional sparse linear regression, Assumption B3 becomes the beta-min condition [37] on the minimal magnitude of nonzero regression coefficients, which is necessary for variable selection consistency.

Assumption B4 (Control on model complexity):

For each , Assumptions A1 and A3 hold with , and ; and for each , Assumptions A1 and A3 hold with , and . Moreover, holds for some sufficiently large constant .

The first part of this assumption is the reminiscent of Assumptions A1 and A3 for controlling the model complexity of each when the regular posterior distribution is used. The second part is due to technical reasons in the proof of Theorem 4 (some control when applying a union bound).

Theorem 4 (Model selection consistency).
  1. (Regular posteriors). Under Assumptions B1-B4, we have

  2. (Fractional posteriors). Under Assumptions B1-B3, we have

Similar to the theorems in Section 2.1 on the contraction rate of the regular and the fractional posterior distributions, the latter is more flexible than the former since it does not demand any assumption like B4 to control the model complexity in order to achieve the Bayesian model selection consistency. Comparing the anti-concentration conditions in B2 for the regular posterior and the fractional posterior, the former has an additional flexibility in the choice of the distance measure due to the additional test condition A1 made for the regular posterior. However, is typically a weaker or equivalent distance measure than the Renyi divergence with certain order . Therefore, the anti-concentration condition for the regular posterior is often a stronger condition than that for the fractional posterior. For the fractional posterior distribution, when a neighborhood of is comparable to the KL-neighborhood in the definition of local Bayesian complexity (for example, regression problems with Gaussian errors), a combination of Assumptions B1 and B2 imposes a two-sided constraint on the local Bayesian complexity on the true model , implying that the fractional posterior tends to concentrate all its mass on the model that contains the truth with smallest local Bayesian complexity.

3.2 Posterior contraction rate and Bayesian oracle inequality for model selection

Following Section 2.2, we show in this subsection that even when model selection consistency does not hold, a Bayesian procedure may still achieve estimation optimality by adaptively concentrating on models with fastest rate of contraction.

Regular posterior distributions:

First, we focus on the use of the regular posterior distribution (3). We require a stronger version of the sieve sequence condition.

Assumption A3 (Sieve sequence condition):

For some constant , and any , there exists a sequence of sieves , , such that

Assumption A3 is stronger than Assumption A3 since it assumes the existence of the sieve sequence for any rather than a single . In most cases, the sieve sequence constructed for verifying Assumption A3 naturally extends to general (see the examples in the following subsections). The following theorem shows that when Assumptions A1 and A3 are true with , for each , the posterior distribution automatically adapts to the best contraction rate over all models.

Theorem 5 (Contraction rate of regular posterior distributions).

If Assumptions A1 and A3 with , hold for each , then as ,

Theorem 5 shows that the overall rate of contraction under model selection is composed of three terms. The first term,

characterizes the rate of posterior contraction under model thorough balancing between the approximation risk and local Bayesian complexity . The second term can be viewed as the local Bayesian complexity over the space of model index set that reflects the prior belief over different models. The third term is a model selection uncertainty term, proportional to the logarithm of the cardinality of models, allowing us to attain the estimation consistency with growing at most exponentially large in the sample size. Under the special case of an “objective prior” that assigns equal prior weight to each model, the second model space complexity term has the same order as the third model selection uncertainty term. In general, the posterior distribution adaptively achieves the fastest rate of contraction that optimally trades-off those three terms, that is,

Theorem 5 requires much weaker conditions than Theorem 4. The former does not require any model identifiability condition nor the anti-concentration condition, which is consistent with the intuition that an optimal convergence rate can be attainable without model selection consistency.

Fractional posterior distributions:

Now let us turn to the fractional posterior distribution (4) based on fractional likelihood functions. First, we present our result on its contraction rate.

Theorem 6 (Contraction rate of fractional posterior distributions).

For any , we have, as ,

Comparing to Theorem 5, Theorem 6 does not demand any conditions like Assumption A1 or A3 for controlling the complexity of models, and obviates the model selection uncertainty term (which is due to a union bound for controlling the covering entropy of ). As a consequence, a fractional posterior distribution may lead to faster posterior contraction rate than a regular posterior distribution (see our example in Section 3.3 for variable selection of nonparametric regression). More interestingly, a fractional posterior distribution may still be able to achieve estimation consistency without the typical constraint on the number of models, such as provided the prior is properly chosen (more discussion is provided after Theorem 7 below). This interesting phenomenon can be explained by the averaging-based analysis nature of Bayesian approaches, as opposed to the worst case analysis nature of optimization based approaches (more discussions on the comparison between typical Bayesian analysis and frequentist analysis can be found in [6]). By applying Theorem 3, we have the following Bayesian oracle inequality for the fractional posterior distribution.

Theorem 7 (Bayesian oracle inequality for model selection).

For any , if is some sufficiently large constant independent of and , then with probability tending to one as ,

Similar to the observations in Theorem 6 and the remarks after Theorem 4, this theorem shows that the Bayesian risk for model selection is the best trade-off between the approximation error , where is defined as the minimizer of in the inner minimization step for a fixed , and a model complexity penalty term consisting of the local Bayesian complexity of the model and the local Bayesian complexity of the model space . Due to the averaging-based analysis nature of a Bayesian procedure, the proof does not rely on any sophisticated empirical process tools, but only elementary Chebyshev’s inequality. Again, the overall Bayesian risk does not explicitly depend on the cardinality of the model space. However, when no informative prior knowledge is available on which models are more likely, we would like to put a non-informative prior distribution on the model space, which implicitly entails some constraints on . For example, in order for to not substantially exceed for any , we need a condition like , . This condition implies a constraint on through

implying that </