In a series of articles, Ryan Martin and his colleagues introduced empirical priors in sparse high dimensional linear regression models (see for example, (Martin:2014, ), (Martin:2017, ), (Martin:2020, ) and (Martin:2020.1, )
). These priors are all data dependent and achieve nice posterior contraction rates, specifically, concentration of the parameter of interest around the true value at a very rapid rate. Moreover, these priors are quite satisfactory for both estimation and prediction as pointed out in their articles.
While (Martin:2017, ) introduced priors both when the error variance in a linear regression model is known and unknown, their theoretical results were proved only for known . The objective of this paper is to fill in the gap and obtain theoretical properties, namely posterior contraction rates for unknown
. The technical novelty of this approach is that unlike the former, our algebraic manipulations require handling of multivariate t-distributions rather than multivariate normal distributions. In addition to the above, we have established model selection consistency as well as a Bernstein von-Mises theorem in our proposed set up.
The outline of this paper is as follows. We have introduced the model and certain basic lemmas needed for the rest of the paper in Section 2. Posterior concentration as well as model selection consistency results are stated in Section 3. Bernstein von-Mises theorem and its application are stated in Section 4. All the proofs are given in Section 5. Some final remarks are made in Section 6.
2 The Model
Consider the standard linear regression model
Let , , denotes the cardinality of . is the rank of . Also let denote a submatrix of the column vectors of corresponding to the elements of S. It is assumed that is nonsingular. The corresponding elements of the regression vector is denoted by . Also, let denote the least square estimator of .
2.1 The prior
2.2 The posterior distribution
Following (Martin:2017, ), we consider also a fractional likelihood
Then the posterior for conditional on and is
Consider the identity
where . Now recalling that is concentrated at
with prior probability, it follows that
Also the conditional posterior for given is
Thus the conditional distribution for given is
Finally, the marginal posterior of is
All our results except Corollary 2 are obtained for all and all . However, for higher-order properties, such as credible probability of set, some conditions on and are required. For example, in Corollary 2 related to uncertainty, we assume . So we can always set close to , close to , then the fractional likelihood is almost the same as the normal likelihood, and is like non-informative prior.
2.3 Tail bounds for the Chi squared distribution
For any , we have
Proof: See Lemma 4.1 of (Cao:2020, ).
(i) For any , we have
(ii) for then
where is a constant.
(iii) For any , .
We define .
Assume the true model is
and let , .
We use to denote for sufficiently large , to denote for sufficiently large .
3 Posterior concentration rates
Define empirical Bayes posterior probability of eventas
and let . Then recalling the pdf of a multivariate distribution, we have
Since , we have to add some regularity conditions to get posterior concentration results for our model.
(A1) There exist constants , , such that ,
For a given , define , i.e the set of vectors with no less then non-zero entries.
The following theorem implies that the posterior distribution is actually concentrated on a space of dimension close to .
Let , and assume conditions (A1)-(A3) to hold. Then there exists constant , such that
with , uniformly in as .
To get posterior concentration results, we first establish model selection result. The following theorem demonstrates that asymptotically our empirical Bayesian posterior will not include any unnecessary variables.
Let , and assume conditions (A1)-(A3) to hold. Also, let the constant in the marginal prior of satisfy . Then , uniformly in .
To get model selection consistency, it remains to show our empirical Bayesian posterior will asymptotically not miss any true variables.
Then is a non-increasing function of . By (Martin:2017, ), for any ,, .
Assume , conditions (A1)-(A3) to hold and
where . We also assume , for as in Theorem 1 and is a constant with , being the parameter in the prior of . Then
(Selection consistency)Assume that the conditions of Theorem 3 hold. Then .
We now state our posterior concentration result. It is similar to the posterior concentration theorem in (Martin:2017, ). But our proof is completely different form theirs. They apply Holder’s inequality and Renyi divergence formula, while the key to our proof is using the model selection consistency result.
Assume conditions (A1)-(A3) hold, then there exist constant such that
uniformly in as , where .
By adding some conditions on , we are able to seperate from so we can get posterior consistency result for .
where is a positive sequence of constants to be specified.
There exists a constant such that
uniformly in as , where
is a constant, , being the constant in Theorem 1.
Proof: Same as the proof of Theorem 3 in (Martin:2017, ).
If for any we have
where , are positive constant, then we get posterior consistency for the coefficient under norm .
4 Bernstein von-Mises Theorem
In this section, we show that the posterior distribution of is asymptotically normal and this property leads to many interesting results.
(Bernstein von-Mises Theorem) Let denote Hellinger distance, denote the total variation distance and be empirical Bayes posterior distribution for . If and , then
uniformly in as , where denotes the point mass distribution for concentrated at the origin.
Since , we also have
uniformly in as .
Define , .
(valid uncertainty quantification) If , then Eq.(13) implies
which validates uncertainty quantification.
Consider a pair where is a given matrix of explanatory variable values at which we want to predict the corresponding response . Let
be the conditional posterior predictive distribution of, given . is a t-distribution, with degree of freedom, location , and the scale matrix
Then the predictive distribution of is
For this predictive distribution, we have similar Bernstein von-Mises Theorem as Theorem 6. The proof is similar.
If and , then
uniformly in as .
Since , we also have
uniformly in as .
Proof of Theorem 1:
Let , .
Let , denote the two terms on the right hand side of (17). It suffices to show that there exists constant , such that with uniformly in for .
by Lemma 1 we have
Let , then . By ,
Using the mgf of a chi-squared distribution,
By (Martin:2017, ) we have . Also since , . Hence
From the expression of , we get
So when , and ,