DeepAI
Log In Sign Up

On frequentist coverage errors of Bayesian credible sets in high dimensions

In this paper, we study frequentist coverage errors of Bayesian credible sets for an approximately linear regression model with (moderately) high dimensional regressors, where the dimension of the regressors may increase with but is smaller than the sample size. Specifically, we consider Bayesian inference on the slope vector by fitting a Gaussian distribution on the error term and putting priors on the slope vector together with the error variance. The Gaussian specification on the error distribution may be incorrect, so that we work with quasi-likelihoods. Under this setup, we derive finite sample bounds on frequentist coverage errors of Bayesian credible rectangles. Derivation of those bounds builds on a novel Berry-Esseen type bound on quasi-posterior distributions and recent results on high-dimensional CLT on hyper-rectangles. We use this general result to quantify coverage errors of Castillo-Nickl and L^∞-credible bands for Gaussian white noise models, linear inverse problems, and (possibly non-Gaussian) nonparametric regression models. In particular, we show that Bayesian credible bands for those nonparametric models have coverage errors decaying polynomially fast in the sample size, implying advantages of Bayesian credible bands over confidence bands based on extreme value theory.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

09/02/2020

Simultaneous inference for Berkson errors-in-variables regression under fixed design

In various applications of regression analysis, in addition to errors in...
08/17/2018

Concentration Based Inference in High Dimensional Generalized Regression Models (I: Statistical Guarantees)

We develop simple and non-asymptotically justified methods for hypothesi...
01/12/2021

Bayesian inference in high-dimensional models

Models with dimension more than the available sample size are now common...
07/19/2020

Berry-Esseen Bounds for Projection Parameters and Partial Correlations with Increasing Dimension

The linear regression model can be used even when the true regression fu...
05/25/2022

Bayesian Multiscale Analysis of the Cox Model

Piecewise constant priors are routinely used in the Bayesian Cox proport...
05/06/2020

Multiscale Bayesian Survival Analysis

We consider Bayesian nonparametric inference in the right-censoring surv...
04/03/2020

Uniform Inference in High-Dimensional Generalized Additive Models

We develop a method for uniform valid confidence bands of a nonparametri...

1. Introduction

Bayesian inference for high or nonparametric statistical models is an active research area in the recent statistics literature. Posterior distributions provide not only point estimates but also credible sets. In a classical regular statistical model with a fixed finite dimensional parameter space, it is well known that the Bernstein–von Mises (BvM) theorem holds under mild conditions and the posterior distribution can be approximated (under the total variation distance) by a normal distribution centered at an efficient estimator (e.g. MLE) and with covariance matrix identical to the inverse of the Fisher information matrix as the sample size increases. The BvM theorem implies that a Bayesian credible set is typically a valid confidence set in the frequentist sense, namely, the coverage probability of a

-Bayesian credible set evaluated under the true parameter value is approaching as the sample size increases; cf. [58], Chapter 10. There is also a large literature on the BvM theorem in nonparametric statistical models. Compared to the finite dimensional case, however, Bayesian uncertainty quantification is more complicated and more sensitive to prior choices in the infinite dimensional case. [21, 25] find some negative results on the BvM theorem in the infinite dimensional case. [37, 40, 7] develop conditions under which the BvM theorem holds for Gaussian white noise models and nonparametric regression models; see also [20, 27, 53]. Employing weaker topologies than , [10] elegantly formulate and establish the BvM theorem for Gaussian white noise models; see also [48] for the adaptive BvM theorem for Gaussian white noise models. Subsequently, [11] establish the BvM theorem in a weighted -type norm for nonparametric regression and density estimation. There are also several papers on frequentist coverage errors of Bayesian credible sets in the -norm. [39] study asymptotic frequentist coverage errors of -type Bayesian credible sets based on Gaussian priors for linear inverse problems; see also [52, 55] for related results. Using an empirical Bayes approach, [54] develop -type Bayesian credible sets adaptive to unknown smoothness of the function of interest. We refer the reader to Chapter 7 in [32] and Chapter 12 in [29] for further references on these topics.

This paper aims at studying frequentist coverage errors of Bayesian credible rectangles in an approximately linear regression model with an increasing number of regressors. We provide finite sample bounds on frequentist coverage errors of (quasi-)Bayesian credible rectangles based on sieve priors, where the model allows both an unknown bias term and an unknown error variance, and the true distribution of the error term may not be Gaussian. Sieve priors are distributions on the slope vector whose dimension increases with the sample size. We allow sieve priors to be non-Gaussian or not to be an independent product. We employ a “quasi-Bayesian” approach because we fit a Gaussian distribution on the error term but the Gaussian specification may be incorrect. The resulting posterior distribution is called a “quasi-posterior.”

An important application of our results is finite sample quantification of Bayesian nonparametric credible bands based on sieve priors. We derive finite sample bounds on coverage errors of Castillo–Nickl [11] and -credible bands in Gaussian white noise models, linear inverse problems, and (possibly non-Gaussian) nonparametric regression models; see Section 3.1 ahead for the definition of Castillo–Nickl credible bands. The literature on frequentist confidence bands is broad. Frequentist approaches to constructing confidence bands date back to Smirnov and Bickel–Rosenblatt [51, 6]; see also [14, 19, 30] for more recent results. In contrast, there are relatively limited results on Bayesian uncertainty quantification based on -type norms. [31] study posterior contraction rates in the -norm for , and [9] derive sharp posterior contraction rates in the -norm. [35] derive adaptive posterior contraction rates in the -norm for Gaussian white noise models and density estimation; see also [63] for adaptive posterior contraction rates. Building on their new BvM theorem, [11] develop credible bands (Castillo-Nickl bands) based on product priors that have correct frequentist coverage probabilities and at the same time shrink at (nearly) minimax optimal rates for Gaussian white noise models. [62] study conditions under which frequentist coverage probabilities of credible bands based on Gaussian series priors approach one as the sample size increases for nonparametric regression models with sub-Gaussian errors. [48] establish qualitative results on adaptive credible bands for Gaussian white noise models. Still, quantitative results on frequentist coverage errors of nonparametric credible bands are scarce. Our quantitative result complements the qualitative results established by [11] and [62] and contributes to the literature on Bayesian nonparametrics by developing deeper understanding on Bayesian uncertainty quantification in nonparametric models. More recently, [61] also derive a quantitative result on coverage errors of Bayesian credible bands based on Gaussian process priors. We will clarify the difference between their results and ours in Section 1.1 ahead.

Notably, our results lead to an implication that supports the use of Bayesian approaches to constructing nonparametric confidence bands. It is well known that confidence bands based on extreme value theory (such as e.g. those of [6]

) perform poorly because of the slow convergence of Gaussian maxima. In the kernel density estimation case,

[33] shows that confidence bands based on extreme value theory have coverage errors decaying only at the rate (regardless of how we choose bandwidths) where is the sample size, while those based on bootstrap have coverage errors (for the surrogate function) decaying polynomially fast in the sample size; see also [14]. Our result shows that Bayesian credible bands (for the surrogate function) have also coverage errors decaying polynomially fast in the sample size and are comparable to bootstrap confidence bands, implying an advantage of Bayesian credible bands over confidence bands based on extreme value theory; see Remark 3.2 for more details.

The main ingredients in the derivation of the coverage error bound in Section 2 are (i) a novel Berry–Esseen type bound for the BvM theorem for sieve priors, i.e., a finite sample bound on the total variation distance between the quasi-posterior distribution based on a sieve prior and the corresponding Gaussian distribution, and (ii) recent results on high dimensional CLT on hyper-rectangles [13, 17]. Our Berry–Esseen type bound improves upon existing BvM-type results for sieve priors; see the discussion in Section 1.1. The high dimensional CLT is used to approximate the sampling distribution of the centering estimator by the Gaussian distribution that matches with the previous Gaussian distribution approximating the (normalized) posterior distribution.

In addition, importantly, derivations of coverage error bounds for nonparametric models in Section 3 are by no means trivial and require further technical arguments. Specifically, for Gaussian white noise models, we will consider both confidence bands with fixed cut-off dimensions and multi-scale confidence bands without fixed cut-off dimensions, which require different analyses on bounding the effect of the bias to the coverage error. For linear inverse problems, we will cover both mildly and severely ill-posed cases. For nonparametric regression models, we will consider random designs and so can not directly apply the result of Section 2 since we assume fixed designs in Section 2; hence we have to take care of the randomness of the design, and to this end, we will employ some empirical process techniques.

1.1. Literature review and contributions

For a nonparametric regression model, [61] derive finite sample bounds on frequentist coverage errors of Bayesian credible bands based on Gaussian process priors. They assume (i) Gaussian process priors, (ii) that the error term follows a sub-Gaussian distribution, and (iii) that the error variance is known. The present paper markedly differs from [61]

in that (i) we work with possibly non-Gaussian priors; (ii) we allow a more flexible error distribution; and (iii) we allow the error variance to be unknown. More specifically, (i) to allow for non-Gaussian priors, we develop novel Berry–Esseen type bounds on quasi-posterior distributions in (mildly) high dimensions. (ii) In addition, to weaken the dimensionality restriction and the moment assumption on the error distribution, we make use of high-dimensional CLT on hyper-rectangles developed in

[13, 17]. (iii) Finally, when the error variance is unknown, the quasi-posterior contraction for the error variance impacts on the coverage error for the slope vector and so a careful analysis is required to take care of the unknown variance.

The present paper also contributes to the literature on the BvM theorem in nonparametric statistics, which is now quite broad; see [10, 11, 25, 37, 40, 48] for Gaussian white noise models, [7, 27] for linear regression models with high dimensional regressors, and [61, 62] for nonparametric regression models with Gaussian process priors. See also [8, 12, 26, 28, 49, 44, 43] for related results. We refer the reader to [3, 18, 24, 38, 41] on the BvM theorem for quasi-posterior distributions.

Importantly, our Berry–Esseen type bound improves on conditions on the critical dimension for the BvM theorem. [27, 7, 53] study such critical dimensions for sieve priors. First, [7] does not cover the case with an unknown error variance, while the results in [27, 53] cover the case with an unknown error variance. Our result is consistent with the result of [7] when the error variance is assumed to be known. Meanwhile, our result substantially improves on the results of [27, 53] for the unknown error variance case. Namely, the results of [27, 53] show that the BvM theorem holds if under typical situations when the error variance is unknown, where is the number of regressors and is the sample size; on the other hand, our result shows that the BvM theorem holds if , thereby improving on the condition of [27, 53]. See Remark 2.2 for more details. Our BvM-type result allows us to cover wider smoothness classes of functions when applied to the analysis of Bayesian credible bands in nonparametric models.

1.2. Organization and notation

The rest of the paper is organized as follows. In Section 2, we consider Bayesian credible rectangles for the slope vector in an approximately linear regression models and derive finite sample bounds on frequentist coverage errors of those credible rectangles. In Section 3, we discuss applications of the general result established in Section 2 to nonparametric models. Specifically, we cover Gaussian white noise models, linear inverse models, and nonparametric regression models with possibly non-Gaussian errors. In Section 4, we present the proof of the main theorem (Theorem 2.1). In Section 5, we provide the proofs of the Berry–Esseen type bound on quasi-posterior distributions (Propositions 2.5 and 2.6). Section 6 contains proofs of the other propositions in Section 2, and Section 7 contains proofs of the propositions in Section 3.

Throughout the paper, we will obey the following notation. Let denote the Euclidean norm, and let denote the max or supremum norm for vectors or functions. Let denote the Gaussian distribution with mean vector and covariance matrix . For , let . For two sequences and depending on , the notation signifies that for some universal constant . The notation signifies that and . The notation indicates a term that is bounded as . For any symmetric positive semidefinite matrices and , the notation signifies that is positive semidefinite. Constants , , and do not depend on the sample size and the dimension . The values of , , and may be different at each appearance.

2. Bayesian credible rectangles in high dimensions

Consider an approximately linear regression model

(1)

where is a vector of outcome variables, is an design matrix, is an unknown coefficient vector, is a deterministic (i.e., non-random) bias term, and is a vector of i.i.d. error terms with mean zero and variance . We are primarily interested in the situation where the number of regressors increases with the sample size , i.e., as , but we often suppress the dependence on for the sake of notational simplicity. In addition, we allow the error variance to depend on , i.e., , which allows us to include Gaussian white noise models in the subsequent analysis as a special case. In the general setting, the error variance is also unknown. In the present paper, we work with the dense model with moderately high-dimensional regressors where need not be sparse and may increase with the sample size but . To be precise, we will maintain the assumption that the design matrix is of full column rank, i.e., . The approximately linear model (1) is flexible enough to cover various nonparametric models such as Gaussian white noise models, linear inverse problems, and nonparametric regression models, via series expansions of functions of interest in those nonparametric models; see Section 3.

We consider Bayesian inference on the slope vector . To this end, we fit a Gaussian distribution on the error , but we allow the Gaussian specification on the error distribution to be incorrect. Namely, we work with the quasi-likelihood of the form

We assume independent priors on and , i.e.,

(2)

where we assume that is absolutely continuous with density , i.e., , and is supported in . Then the resulting quasi-posterior distribution for is

and the marginal quasi-posterior distribution for is , where

and denotes the quasi-posterior distribution for . We will assume that may be data-dependent, e.g., for some estimator of (in that case, ), but is data-independent.

We will derive finite sample bounds on frequentist coverage errors of Bayesian credible rectangles for the approximately linear model (1) under a prior of the form (2). For given , , and a given positive sequence , let denote the hyper-rectangle of the form

Let denote the OLS estimator for , i.e., . For given , we consider a -credible rectangle of the form , where the radius

is chosen in such a way that the posterior probability of the set

is , i.e., .

We make the following conditions on the priors and . For , let

(3)

where quantifies “flatness” of the prior density around the true value .

Condition 2.1.

There exists a positive constant such that

Condition 2.2.

There exist non-negative constants such that with probability at least ,

Condition 2.3.

The inequality holds.

Condition 2.1 assumes that the prior on has a sufficient mass around its true value . Condition 2.2 is an assumption on the marginal posterior contraction for the error variance . Condition 2.2 includes the known error variance case as a special case; if the error variance is known, then we may take (Dirac delta at ) and . Condition 2.3 is a preliminary flatness condition on . More detailed discussions on these conditions are provided after the main theorem (Theorem 2.1).

We also make the following conditions on the model.

Condition 2.4.

There exists a positive constant such that

Condition 2.5.

There exists a positive constant such that one of the following conditions holds:

  1. for some integer ;

  2. .

Condition 2.4 controls the norm of the bias term. Condition 2.5 is a moment condition on the error distribution. These conditions are sufficiently weak and in particular covers all the applications we present.

The following theorem, which is the main result of this section, provides bounds on frequentist coverage errors of the Bayesian credible rectangle together with bounds on the max-diameter of . In what follows, let and

denote the maximum and minimum eigenvalues of the matrix

, respectively, and let

denote the maximal and minimal weights, respectively.

Theorem 2.1 (Coverage errors of credible rectangles).

Suppose that Conditions 2.12.4 and either of Condition 2.5 (a) or (b) hold. Then there exist positive constants depending only on and such that the following hold. For every , we have

(4)

where and

In addition, provided that the right hand side on (4) is smaller than , for sufficiently large depending only on , the max-diameter of is bounded as

with probability at least

Theorem 2.1 shows that that the frequentist coverage error of the Bayesian credible rectangle depends on the prior on only through the flatness function . The discussions below provide a typical bound on .

2.1. Discussions on conditions

We first verify that a locally log-Lipschitz prior satisfies Conditions 2.1 and 2.3, providing an upper bound of .

Definition 2.1.

A locally log-Lipschitz prior is defined as a prior distribution on that there exists such that for all with .

Proposition 2.1.

For a locally log-Lipschitz prior with log-Lipschitz constant , we have for any . Hence the prior satisfies Condition 2.3 if .

To provide examples of prior distributions on that satisfy Condition 2.1, we focus on the following two subclasses of locally log-Lipschitz priors. Let denote the Euclidean norm of .

  1. An isotropic prior is of the form where

    is a probability density function on

    such that is strictly positive and continuously differentiable on , and such that for all for some positive constant .

  2. A product prior of log-Lipschitz priors is of the form where each is strictly positive on and -Lipschitz for some .

For the sake of exposition, we make the following additional condition to verify that isotropic or product priors satisfy Condition 2.1.

Condition 2.6.

There exists a positive constant such that .

This condition is satisfied as long as is bounded by some polynomial of and hence is not restrictive. In fact, this condition is satisfied in all the applications we will cover in Section 3. The following proposition shows that isotropic or product priors are locally log-Lipschitz priors satisfying Condition 2.1.

Proposition 2.2.

Under Condition 2.6, an isotropic prior and a product prior of log-Lipschitz priors satisfy Condition 2.1. An isotropic prior is a locally log-Lipschitz prior with locally log-Lipschitz constant such that

for some positive constant depending only on and that appears in the definition of and Condition 2.6. In particular, if is the standard Gaussian density, then . A product prior of log-Lipschitz priors with log-Lipschitz constant is locally log-Lipschitz with .

Next, we will discuss on Condition 2.2. We consider following two cases:

  1. with ;

  2. is the standard Gaussian distribution and

    is the inverse Gamma distribution

    with shape parameter and scale parameter .

The following two propositions yield possible choices of and .

Proposition 2.3 (Plug-in).

Suppose that Condition 2.5 holds and also that for some . In addition, suppose that satisfies that . Then there exist positive constants and depending only on , and such that

Proposition 2.4 (Full-Bayes).

Suppose that Condition 2.5 holds and also for some . In addition, suppose that satisfies that . Then there exist positive constants and depending only on , , , and such that

with probability at least

To better understand implications of these propositions, Table 1 summarizes possible rates of when for some , , and is independent of .

Condition 2.5 and prior
(a) and plug-in
(a) and full Bayes
(b) and plug-in
(b) and full Bayes
Table 1. Possible rates of with respect to : is arbitrary.
Remark 2.1 (Comparison with [62]).

Proposition 4.1 in [62] studies possible rates for when a Gaussian prior is used for and the error distribution is sub-Gaussian. Our results in Propositions 2.3 and 2.4 are compatible with their result up to logarithmic factors under their setup.

2.2. Berry–Esseen type bounds on posterior distributions

Before presenting applications of the main theorem, we derive an important ingredient in the proof of Theorem 2.1, namely, the Berry–Esseen type bound on posterior distributions. For , let be the intersection of the events and . For two probability measures and , denotes the total variation between and .

Proposition 2.5 (Berry–Esseen type bounds on posterior distributions).

Under Conditions 2.12.3, there exist positive constants and depending only on such that for every and for , we have

Proposition 2.6.

Under Conditions 2.4 and 2.5, there exist positive constants and depending only on , , and such that

Remark 2.2 (Critical dimension for the Bernstein–von Mises theorem).

The previous propositions immediately lead to the critical dimension for the BvM theorem. We will compare our result with the results on the critical dimension by [7, 28, 53]. In this comparison, we assume a locally log-Lipschitz prior with locally log-Lipschitz constant ; that and are independent of ; and that . The following are a summary of the existing results:

  • [28] shows that when the error distribution has a smooth density with known scale parameter, the BvM theorem holds if and some additional assumptions are verified;

  • [7] shows that when the error distribution is Gaussian with known variance, the BvM theorem holds if ;

  • [53] shows that when the high-dimensional local asymptotic normality holds, the BvM theorem holds if ; see also [46].

Our result (Propositions 2.1, 2.3, 2.5, and 2.6) improves on the existing work in that:

  • When the error variance is assumed to be known (i.e., ), our result implies that the BvM theorem (for the quasi-posterior distribution) holds if and the error distribution has finite fourth moment. Compared to [28], our result substantially improves on the critical dimension by employing the Gaussian likelihood even when the Gaussian specification is incorrect. When the error distribution is Gaussian, our result is consistent with [7];

  • Importantly, our result covers the unknown error variance case, which makes our analysis substantially different from [7]. In nonparametric regression, it is usually the case that the error variance is unknown, and hence it is important to take case of the unknown variance in such an application. When the error variance is unknown, our result shows that the BvM theorem holds for if for sub-Gaussian error distributions, thereby improving on the condition of [53].

3. Applications

In this section, we consider applications of the general results developed in the previous sections to quantifying coverage errors of Bayesian credible sets in Gaussian white noise models, linear inverse problems, and (possibly non-Gaussian) nonparametric regression models.

3.1. Gaussian white noise model

We first consider a Gaussian white noise model and analyze coverage errors of Castillo-Nickl and -credible bands. Consider a Gaussian white noise model

where is a canonical white noise and is an unknown function. We assume that is in the Hölder–Zygmund space with smoothness level for some . It will be convenient to define the Hölder–Zygmund space by employing a wavelet basis. Let be an integer and fix sufficiently large . Let be an -regular Cohen–Daubechies–Vial (CDV) wavelet basis of . Then the Hölder–Zygmund space is defined by

where denotes the inner product, i.e., . For the notational convention, let for and let for .

3.1.1. Castillo–Nickl credible bands

The Castillo–Nickl credible band is defined as

where is taken in such a way that for a positive non-decreasing sequence and For a given prior on , we call the -Castillo–Nickl credible band, where the radius is chosen in such a way that We consider a sieve prior on induced from a prior on via the map

The following theorem establishes bounds on coverage errors of Castillo–Nickl credible bands. Let with for .

Proposition 3.1.

Under Conditions 2.1 and 2.3 for that corresponds to and under the condition that for some , there exist positive constants depending only on appearing in Condition 2.1 and such that the following hold: for satisfying , we have

In addition, provided that the right hand side above is smaller than , for sufficiently large depending only on and , the -diameter of is bounded from above as

with probability at least .

Remark 3.1 (Coverage error rates).

The finite sample bound in Proposition 3.1 leads to the following asymptotic results as . In this discussion, we assume a locally log-Lipschitz prior with locally log-Lipschitz constant . Since and , we have

(5)

In particular, for the standard Gaussian prior, we have

since from Proposition 2.2.

Remark 3.2 (Coverage errors for the surrogate function).

Consider coverage errors for the surrogate function . In this case, we can set and so

This shows that Bayesian credible bands have coverage errors (for the surrogate function) decaying polynomially fast in the sample size . In the kernel density estimation case, [33] shows that confidence bands based on Gumbel approximations have coverage errors decaying only at the rate, while bootstrap confidence bands have coverage errors decaying polynomially fast in for the surrogate function.

Remark 3.3 (Coverage errors for the true function).

Consider coverage errors for the true function . In this case, the term appears in (5) due to the approximation error and seems problematic. However, we have options to control the effect of the term since we can choose an undersmoothing sequence arbitrarily. Multiplying an undersmoothing sequence by a positive constant, we can reduce the effect of the term . When the smoothness levels that we cover are fixed, choosing a polynomially-increasing undersmoothing sequence , we can recover polynomial decays of both coverage errors and -diameters; consider that we cover the smoothness levels of the true function that are in a fixed range with a least upper bound (e.g., ) but the true function is more smooth than we cover (e.g., ). In this case, taking a polynomially-increasing undersmoothing sequence (e.g., ), we ensure that both the decay of coverage errors and high-probability -diameters of are of the order up to a logarithmic factor.

[11] also consider multi-scale sets based on an admissible sequence :

where and Here we call a sequence such that an admissible sequence. In what follows, we will bound coverage errors and the -diameters of Bayesian credible sets of the form

where the radius is taken in such a way that

The following proposition provides coverage errors of multi-scale credible bands based on a sieve prior on , where is taken in a way that . Let

respectively. For simplicity, we assume that .

Proposition 3.2.

Under Conditions 2.1 and 2.3 for that corresponds to , there exist positive constants depending only on such that the following hold: For satisfying and for any , we have