High-dimensional properties for empirical priors in linear regression with unknown error variance

02/11/2022
by   Xiao Fang, et al.
University of Florida
0

We study full Bayesian procedures for high-dimensional linear regression. We adopt data-dependent empirical priors introduced in [1]. In their paper, these priors have nice posterior contraction properties and are easy to compute. Our paper extend their theoretical results to the case of unknown error variance . Under proper sparsity assumption, we achieve model selection consistency, posterior contraction rates as well as Bernstein von-Mises theorem by analyzing multivariate t-distribution.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

12/19/2019

Bayesian high-dimensional linear regression with generic spike-and-slab priors

Spike-and-slab priors are popular Bayesian solutions for high-dimensiona...
10/07/2020

Posterior contraction in group sparse logit models for categorical responses

This paper studies posterior contraction in multi-category logit models ...
08/04/2019

Method of Contraction-Expansion (MOCE) for Simultaneous Inference in Linear Models

Simultaneous inference after model selection is of critical importance t...
08/30/2020

Bayesian High-dimensional Semi-parametric Inference beyond sub-Gaussian Errors

We consider a sparse linear regression model with unknown symmetric erro...
06/11/2018

A framework for posterior consistency in model selection

We develop a theoretical framework for the frequentist assessment of Bay...
08/24/2020

Unified Bayesian asymptotic theory for sparse linear regression

We study frequentist asymptotic properties of Bayesian procedures for hi...
03/03/2019

Empirical priors for prediction in sparse high-dimensional linear regression

Often the primary goal of fitting a regression model is prediction, but ...

1 Introduction

In a series of articles, Ryan Martin and his colleagues introduced empirical priors in sparse high dimensional linear regression models (see for example, (Martin:2014, ), (Martin:2017, ), (Martin:2020, ) and (Martin:2020.1, )

). These priors are all data dependent and achieve nice posterior contraction rates, specifically, concentration of the parameter of interest around the true value at a very rapid rate. Moreover, these priors are quite satisfactory for both estimation and prediction as pointed out in their articles.

While (Martin:2017, ) introduced priors both when the error variance in a linear regression model is known and unknown, their theoretical results were proved only for known . The objective of this paper is to fill in the gap and obtain theoretical properties, namely posterior contraction rates for unknown

. The technical novelty of this approach is that unlike the former, our algebraic manipulations require handling of multivariate t-distributions rather than multivariate normal distributions. In addition to the above, we have established model selection consistency as well as a Bernstein von-Mises theorem in our proposed set up.

The outline of this paper is as follows. We have introduced the model and certain basic lemmas needed for the rest of the paper in Section 2. Posterior concentration as well as model selection consistency results are stated in Section 3. Bernstein von-Mises theorem and its application are stated in Section 4. All the proofs are given in Section 5. Some final remarks are made in Section 6.

2 The Model

Consider the standard linear regression model

(1)

where is a vector of response variables, is design matrix, is a regression parameter, is an unknown scale parameter and .

Let , , denotes the cardinality of . is the rank of . Also let denote a submatrix of the column vectors of corresponding to the elements of S. It is assumed that is nonsingular. The corresponding elements of the regression vector is denoted by . Also, let denote the least square estimator of .

2.1 The prior

The prior considered is as follows:

(i) ,

=0 with probability 1.



(ii)

, i.e an inverse gamma distribution with shape and scale parameters

and .

(iii) Marginal priors for : , where , and for and for .

Then the empirical joint prior for is

(2)

where denotes the Dirac-Delta function.

2.2 The posterior distribution

Following (Martin:2017, ), we consider also a fractional likelihood

Then the posterior for conditional on and is

(3)

Consider the identity

where . Now recalling that is concentrated at

with prior probability

, it follows that

(4)

Also the conditional posterior for given is

(5)

Thus the conditional distribution for given is

(6)

where

(7)

Finally, the marginal posterior of is

(8)
Remark 1.

All our results except Corollary 2 are obtained for all and all . However, for higher-order properties, such as credible probability of set, some conditions on and are required. For example, in Corollary 2 related to uncertainty, we assume . So we can always set close to , close to , then the fractional likelihood is almost the same as the normal likelihood, and is like non-informative prior.

2.3 Tail bounds for the Chi squared distribution

Lemma 1.

For any , we have

Proof: See Lemma 4.1 of (Cao:2020, ).

Lemma 2.

(i) For any , we have

(ii) for then

where is a constant.

(iii) For any , .

Proof: See (Cao:2020, ) Lemma 4.2 for (i) and for (ii) on (Shin:2019, ). (iii) follows from the fact if , then is strictly increasing in for fixed and . Hence

2.4 Notations

We define .

Assume the true model is

and let , .

We use to denote for sufficiently large , to denote for sufficiently large .

3 Posterior concentration rates

Define empirical Bayes posterior probability of event

as

(9)

and let . Then recalling the pdf of a multivariate distribution, we have

(10)

Since , we have to add some regularity conditions to get posterior concentration results for our model.

regularity conditions:
(A1) There exist constants , , such that ,
(A2) .
(A3) .

For a given , define , i.e the set of vectors with no less then non-zero entries.

The following theorem implies that the posterior distribution is actually concentrated on a space of dimension close to .

Theorem 1.

Let , and assume conditions (A1)-(A3) to hold. Then there exists constant , such that

with , uniformly in as .

To get posterior concentration results, we first establish model selection result. The following theorem demonstrates that asymptotically our empirical Bayesian posterior will not include any unnecessary variables.

Theorem 2.

Let , and assume conditions (A1)-(A3) to hold. Also, let the constant in the marginal prior of satisfy . Then , uniformly in .

To get model selection consistency, it remains to show our empirical Bayesian posterior will asymptotically not miss any true variables.

Define

(11)

Then is a non-increasing function of . By (Martin:2017, ), for any ,, .

Theorem 3.

Assume , conditions (A1)-(A3) to hold and

where . We also assume , for as in Theorem 1 and is a constant with , being the parameter in the prior of . Then

Remark 2.

Intuitively, we are not able to distinguish between and a very small non-zero. Hence, in Theorem 3 we define a cutoff , which is similar to Theorem 5 in(Martin:2017, ) and Theorem 5 in (Castillo:2015, ).

Corollary 1.

(Selection consistency)Assume that the conditions of Theorem 3 hold. Then .

Proof: Follows immediately from Theorems 2 and 3.

We now state our posterior concentration result. It is similar to the posterior concentration theorem in (Martin:2017, ). But our proof is completely different form theirs. They apply Holder’s inequality and Renyi divergence formula, while the key to our proof is using the model selection consistency result.

Set

Theorem 4.

Assume conditions (A1)-(A3) hold, then there exist constant such that

uniformly in as , where .

By adding some conditions on , we are able to seperate from so we can get posterior consistency result for .

Set

where is a positive sequence of constants to be specified.

Theorem 5.

There exists a constant such that

uniformly in as , where

is a constant, , being the constant in Theorem 1.

Proof: Same as the proof of Theorem 3 in (Martin:2017, ).

Remark 3.

If for any we have

where , are positive constant, then we get posterior consistency for the coefficient under norm .

4 Bernstein von-Mises Theorem

In this section, we show that the posterior distribution of is asymptotically normal and this property leads to many interesting results.

Theorem 6.

(Bernstein von-Mises Theorem) Let denote Hellinger distance, denote the total variation distance and be empirical Bayes posterior distribution for . If and , then

(12)

uniformly in as , where denotes the point mass distribution for concentrated at the origin.

Since , we also have

(13)

uniformly in as .

Define , .

Corollary 2.

(valid uncertainty quantification) If , then Eq.(13) implies

(14)

which validates uncertainty quantification.

Consider a pair where is a given matrix of explanatory variable values at which we want to predict the corresponding response . Let

be the conditional posterior predictive distribution of

, given . is a t-distribution, with degree of freedom, location , and the scale matrix

(15)

Then the predictive distribution of is

(16)

For this predictive distribution, we have similar Bernstein von-Mises Theorem as Theorem 6. The proof is similar.

Theorem 7.

If and , then

uniformly in as .

Since , we also have

uniformly in as .

5 Proofs

Proof of Theorem 1:
Let , .

(17)

Let , denote the two terms on the right hand side of (17). It suffices to show that there exists constant , such that with uniformly in for .

Since

(18)

by Lemma 1 we have

(19)

Let , then . By ,

(20)

Using the mgf of a chi-squared distribution,

(21)

where .

By (Martin:2017, ) we have . Also since , . Hence

(22)

From the expression of , we get

(23)

and

So when , and ,

(24)

as .

By (19) and (24), when , and , we have

with , uniformly in as .


Proof of Theorem 2:
Since

by Theorem