Revisiting High Dimensional Bayesian Model Selection for Gaussian Regression

05/15/2019 ∙ by Zikun Yang, et al. ∙ 0

Model selection for regression problems with an increasing number of covariates continues to be an important problem both theoretically and in applications. Model selection consistency and mean structure reconstruction depend on the interplay between the Bayes factor learning rate and the penalization on model complexity. In this work, we present results for the Zellner-Siow prior for regression coefficients paired with a Poisson prior for model complexity. We show that model selection consistency restricts the dimension of the true model from increasing too quickly. Further, we show that the additional contribution to the mean structure from new covariates must be large enough to overcome the complexity penalty. The average Bayes factors for different sets of models involves random variables over the choices of columns from the design matrix. We show that a large class these random variables have no moments asymptotically and need to be analyzed using stable laws. We derive the domain of attraction for these random variables and obtain conditions on the design matrix that provide for the control of false discoveries.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We are considering a generic Gaussian regression problem with a response vector

and a design matrix , such that

(1)

In eq. 1, is a -dimensional vector of 1’s, is an intercept term included in all the models, and is a vector of regression coefficients. We assume that the data are generated by the true model . The dimensions of the true model and the full model are assumed to be growing with the sample size, hence the problem is in the high-dimensional regime. To address the problem, we also make a sparse assumption such that a large portion of the coefficients in the full model are actually zero, and the goal here is to recover the exact support set,

(2)

of the true model.

The high-dimensional sparse solutions for the Gaussian regression has been extensively investigated, and one of the most famous and inspiring methods is the Least Absolute Shrinkage and Selection Operator, so-called the Lasso [17]

. It is well-known that the estimation of the Lasso achieves sparsity while suffers from a consistent bias, and there has been tremendous efforts for fixing the inadequacies of the Lasso, such as, the Elastic net

[25], the SCAD [5], etc. Meanwhile, Bayesian variants based on the idea of shrinking coefficients have also received great attentions, e.g., the Bayesian Lasso [14], the Horseshoe prior [2] etc. All of these methods attain the sparse solution by shrinking certain coefficients to zero, and are assessed by some measurements of predictions. Rather than focusing on the criteria based on various norms , typically and norms, the current paper pays attentions to variable-selection consistency under 0-1 loss through Bayesian methodology.

The consistency of Bayesian model selection has been established in many papers. The general procedure of Bayesian model selection is usually based on the posterior evidence provided by the Bayes factor, which makes a direct comparison between the marginals of the null and the alternative models [6]. The mathematical nature of the Bayes factor prevents the deployment of the improper prior, since the nuisance constant in the priors would cause the identifiable problem for any statistical inference. To fix this problem, the intrinsic prior was deliberately designed [1] to cancel out the nuisance constants, and the consistency of model selection for pair-wise comparison between the null and alternative model has been established [3], and for the situations where the dimensions of the models are growing with the sample size [12]

. However, the pair-wise consistency can not guarantee the posterior probability of the true model going to one due to the massive number of the models.


Another choice of prior distribution for model selection is famous Zellner’s -prior given by,

(3)

where and is the corresponding design matrix composed with the predictors whose indexes are in the set . The popularity of Zellner’s -prior is primarily due to its feasibility of the computation of the corresponding Bayes factor with a closed form. However, this relatively simple form of Bayes factor arises several paradoxes of model selection , and many empirical choices for ‘g’ parameter have been proposed to remedy the situation [10]. Liang et al., [10] proposed the mixtures of -prior by randomizing ‘g’ parameter to overcome the paradoxes, and showed the consistency of model selection with the mixtures of -prior, especially with Zellner&Siow prior [24]. It is also interested to notice that the intrinsic prior can be represented as the mixture of -prior with a proven consistency of model selection [22].

One vital assumption in these papers is that the dimension of the full model is not growing with the sample size, hence the total number of the models is well controlled. Whether or not the size of the true model is growing also has important consequence on the assumptions of the design matrix and the consistency of model selection. With a fixed true model, the procedures, i.e., [13] [21]

, can tolerate more crucial conditions, such as the size of the full model can be growing at an exponential rate of the sample size or the eigenvalues of the gram matrix can converge to zero. On the other hand, the conditions or assumptions would be harsher, if the dimension of the true model is growing, especially when the situation of the sparsity is close to linear sparsity

[20]. Also, when the dimensions are growing, the learning rate of the Bayes factor may break down, e.g., the models nesting the true model become indistinguishable to the true model. Hence, the prior on the model space is often required to offer extra penalties of the dimensionality. These priors, e.g., the sparsity prior [4] [23] or the truncated Poisson prior [21], are all designed to make penalization on the overfitted models and control the model size for models that are not identifiable from the data [9].

In this paper, we show that the model selection consistency through the truncated Poisson prior on the model space and the modified Zellner&Siow prior on the regression parameters. The paper is organized as follows. In Section 2, the main results is presented. We review the assumptions of the design matrix used in other papers, list the assumptions that we adopt, and emphasize the corresponding impacts on the consistency. The difference between the sparsity prior and the truncated Poisson prior is highlighted. In Section 3, we show an interesting utility of the stable law and the domain of attraction on the overfitted models. Section 4 contains the conclusion and possible directions of future works. The proofs of the theorems are in Section 5.

1.1 Notification

As stated above, let , , and denote the indices set of the predictors in the testing model, the true model, and the full model respectively. For a vector and a set , is the vector of s for , and is the cardinality of . Let denote the null model, the model only contains the intercept term.

2 Main results

2.1 Mixture of -prior & the Bayes factor

Now we specify the prior distributions of the parameters. For any model , the model and the priors are

(4)

where , , which corresponds to the recommendation of unit information prior [6]. Zellner&Siow [24]

place a multivariate Cauchy distribution on the coefficient vector, which can be represented as a mixture of normal distributions with the mixing parameter

. The Zellner-Siow prior is a multivariate extension of Jeffreys’s idea on the normal mean hypothesis problem, where Jeffrey argued that Cauchy prior is the simplest form to fix the paradoxes mentioned above.

The goal of this paper is to show the selection consistency, such that

(5)

which is equivalent to show that

(6)

The Bayes factor is defined as the ratio of the marginal densities between two competing models in eq. 6, such as

(7)

where is the marginal density of the model after integrating out the regression parameters. For simple computation, it is worth noting that the marginal density is invariant if switching by and assigning to . Integrating out the parameters , , and is straight forward, and the Bayes factor can be represented as an univariate integration of , such as

(8)

where is the coefficient of determination and is the projection matrix generated by the design matrix . In [10], the authors suggested to approximate the integral in eq. 8 by the Laplace method [18], which requires to solve a cubic function of . Since the dimension of the true model is assumed to be fixed in [10], the solution is asymptotically stable as , which is not the case in the current set-up. Actually, the unsatisfied approximation is the main reason that the Z&S prior never got popular in the first place, even with appealing statistical properties.

In this paper, we choose the Beta-prime distribution with the shape parameters and as a modified version of the Z&S prior. The parameters are deliberately chosen to cancel out the term in eq. 8, also to maintain an asymptotically equivalent behavior of the density function to the Zellner&Siow prior at the tail area and the origin. The same set-up has been adopted in [11], and the resulting Bayes factor between the testing model and the true model can be shown as

(9)

where is the Gamma function.

Remark 1

The major difference between the current paper and [23] is the choice of the original -prior or the mixture of -prior. Certainly, the original -prior with an empirical choice on , e.g., or [23] [7], does exhibit certain flexibility for the model selection. However, the information paradox associated with the original -prior demands more assumptions on the signals to prove the consisitency [10] [16] [23]. Also, when the null model is the true model, consistency only holds true for the Zellner-Siow prior, but does not hold for the empirical -prior [10]. The advantages of adopting the mixture of -prior are easy to observe. First, it avoids the tuning step of , since it has been integrate out for the marginals. Second, it can be shown that either when or , the Bayes factor associated with the mixture of holds the consistency of the selection, where it is not the case for the empirical -prior. ∎

2.2 Model prior

In the fixed dimension regime, the prior of models is usually given equal prior probability and would be canceled out during the procedure. Also, the convergence of the pair-wise comparison is enough for establishing the consistency, since the summation in

eq. 6 is only over a finite number of models. However, things are different in the high-dimensional regime, and the convergence of the pair-wise comparison doesn’t ensure the posterior selection consistency in eq. 5.

For a generic model , the prior probability of choosing can be represented as

(10)

where is the prior probability of the size , and all the models with same size equally share the same probability mass . In [4] [13] [23], the authors chose the sparse prior to introduce strong penalty on the dimension with the form

(11)

for positive . If we assumed , then eq. 11 is a very strong penalization, and the consistencies have been proved in the papers adopted the sparse prior. Equation 11 is also equivalent to , which is the prior inclusion probability of adding one generic covariate. From this perspective, it is obvious that the prior inclusion probability only depends on the size of the full model without any consideration of the size of the current testing model, which is less adjustable or flexible for a variety of situations. Also, eq. 11 lacks a properly realistic explanation to the practitioners at face value. Last but not the least, when comparing the growing true model to a finite model nested in the true model, in eq. 6 increases in an exponential rate of the sample size, which imposes harsh and unrealistic conditions on design matrix and signal noise ratio to achieve model selection consistency.

We propose to use a truncated Poisson prior on the size of models, such that

(12)

where is the rate parameter, which can be tuned by practitioners. The derivation of the truncated Poisson prior comes from using the self-similarity property for model spaces with finite and letting with an easy proof in [21]

. The prior inclusion probability of the truncated Poisson distribution is only associated with the testing model size

, which is more adjustable than the sparsity prior. When dealing with the overfitted models, the truncated Poisson prior provides a well-designed penalty to convert the summation of the Bayes factors generated by the models with same size to its arithmetic average, hence the convergence is readdressed by the probabilistic property of the average of the Bayes factor instead of the overly-strong dimensional penalty such as the sparsity prior. Also, the average of the Bayes factor is an average of the random variables generated by randomly choosing the extraneous predictors, which leads to an interesting application of the stable law and the domain of attraction. We will give a more detailed explanation in Section 3.

2.3 Assumptions

The assumptions that we will make in this section are crucial to the proof of the consistency. These assumptions reflect the integrity of the problem, i.e., the model with too strict assumptions would lost practical usage to real-world data yet the model without any assumptions could not be proved to attain the consistency.

Assumption 1 (Conditions on the design matrix).

  • Assume all the columns of the design matrix are standardized, such as

    (13)
  • There exists a positive constant , such that

    (14)

    where is the eigenvalue of a symmetric matrix .

  • We assume a standardized errors constraining assumption, such as,

    (15)

    where , and is the projection matrix generated by the design matrix .

Assumption 2 (Condition on the mean structure).

Given , , and , assume

(16)

where is a finite positive constant. ∎

Assumption 3 (Conditions on the growing rates).

The growing rate of the full model can be only as fast as a fraction of the sample size, i.e., . ∎

Assumption 4 (Condition on the minimum signal).

Let denote , then we assume that , which implies . Specifically, the minimum of the true coefficients is bounded above and below as

(17)

where is a finite constant bounded above by zero, and is a finite constant implied by creftypecap 2. ∎

Remark 2

In creftype 1, and are typically mild conditions that appear in many literatures considering the consistency, e.g., [13] [21] [23] [16]. in creftype 1 guarantees that each potentially testing model is not too close to be distinguish to each other. is to constrain the projection of the standardized errors on the space generated by any extraneous predictors excluding the space generated by the design matrix nesting the true model. This assumption is also a mild condition just to prevent any extreme behavior from the overfitted model and their extraneous predictors.

creftypecap 2 is a reasonable condition putting onto the true mean structure, which surprisedly was not considered by many literatures explicitly. Without creftype 2, under some circumstances, the true mean would grow to infinity with the sample size, which doesn’t make any sense. creftypecap 3 assumes the growing rate of the full model is linear to the sample size. creftypecap 4 is related to the information-theoretical capacity of the model to recovery the exact support [20]. As the growing rate of the true model hitting the limit, i.e., , the minimum of the signal has to correspond to such harsh condition to separate itself from the noise. The reason for this rate is shown in proposition 1. ∎

2.4 Result

Proposition 1.

Suppose that creftypecap 1, creftypecap 2, creftypecap 3, and creftypecap 4 hold, the fastest growing rate of the true model that the procedure can tolerate is .

Proof.

See Section 5. ∎

Here are the main theorem of this paper.

Theorem 1.

Suppose that creftypecap 1, creftypecap 2, creftypecap 3, and creftypecap 4 hold. Suppose . If

(18)

then, we have

(19)

with probability at least .

Proof.

See Section 5. ∎

The constant characterizes the relationship between the minimums of the eigenvalues and the signals, which all hinges on the growing rate of the true model. The probability in theorem 1 is determined by the worst scenario, where the testing models are the overfitted models with only one extraneous predictor. The reason behind such a slowly convergent rate is due the fast growing rate of the true model.

It is reasonable to consider the scenario when the growing rate of the true model is for . It can be shown that the relatively strict assumptions can be loosen, and the probability associated with the consistency would be faster on a different order.

3 Stable law

One interesting encounter from the proof is an application of the stable law and the domain of attraction, when dealing with the overfitted models. Let denote the model set of overfitted models with number of extraneous predictors, such that

(20)

Then, it is straight forward to show

(21)

where is the arithmetic average of the Bayes factor over the model set . As , each Bayes factor can be seen as a random variable which is randomly chosen from the set . Furthermore, as shown in eq. 9, the Bayes factor can be well approximated as

(22)

where for , and the covergent rate of the fractions associated with the terms of the Gamma functions is . Hence, the average over the Bayes factor is equivalent to the average over the random variable , which is generated by models . However, a simple integration can show that the r.v. doesn’t have any finite moment. Hence, to show the convergence of the average, the stable law is required. The next theorem concludes the convergence of the average for i.i.d. situation.

Theorem 2.

Suppose and , where is a finite positive integer and . Then given the constants and , we have

(23)

where is a Cauchy random variable.

This theorem reveals the fact that the average of the Bayes factor converges to a Cauchy r.v. and a slowing varying function associated with the total number of the models in the set . Since the convergent power provided by the Bayes factor is only , the corollary next shows that it is impossible to show the convergence for the overfitted models without assumptions controlling the behavior of the errors.

Corollary 1.

Given a dimensional difference and the corresponding model set defined in eq. 20, we have

(24)
Proof.

Let denote the number of the models in . We have seen that the summation above can be represented the average of the r.v. , and the Gamma functions in eq. 22 only has the rate of . Suppose s are i.i.d., hence

(25)

where is a function of the difference , and not equal to 0 if is a finite positive integer. ∎

This corollary shows that the consistency requires extra assumptions, e.g., (iii) in creftype 1, to control the projection of the errors on the subspace generated by the non-true predictors, when the growing rate of the true model has reached the limit, i.e., . If the growing rate of the true model is slower than the limit, then the convergent rate provided by the Bayes factor will overcome the slowly varying function in and .

4 Conclusion

In this paper, the model selection problem of Gaussian regression in high-dimensional regime has been studied. The mixture of -prior and the truncated Poisson prior have been proposed to show the consistency of model selection. We presented the consistency theorem under the situation of extreme growing rate. It showed that when the limit of the growing rate of the true model is approached, the consistency theorem requires more strict assumptions. The stable law has been applied to make the argument of the unidentifiable situation for the overfitted models, when there is no extra assumptions on the extraneous predictors and the growing rate of the true model reaches the limit.

For the next step, it is interesting to further investigate the stable law under the dependent situation due to the possible dependency structure between the extraneous covariates. Another direction would be to discuss the situation when the size of the full model is greater than the sample size. The authors in [23] simply truncated the model space by assigning zero probability to the models whose size is greater than the sample size. We consider to adopt the PCA method to reduce the dimension as the first step of the model selection.

5 Appendix

5.1 Lemmas

Let denote the projection matrix onto the span of .

Lemma 1.

A projection is a non-expensive mapping[23]. In the words, for any column , ,

(26)
Proof.
(27)

(i) is by the Cauchy-Schwarz inequality. ∎

Lemma 2.

For any column , ,

(28)
Proof.

Direct result from use of Lemma 1 and in creftypecap 1. ∎

Lemma 3.

Under creftypecap 1, for any pair , , , we have

(29)
Proof.

W.t.l.g., assume , then by the formula of the blockwise inversion, one can show that the lower right corner of the matrix is . The rest follows. ∎

Lemma 4.

For any vector and a generic symmetric matrix , we have

(30)

where and are the smallest and largest eigenvalues of .

Proof.

See [15]. ∎

Lemma 5.

Suppose is in the model set , then it can be shown that

(31)
Proof.

Notice that .

(32)

Remark 3

there is a question. ∎

5.2 Proof of Theorem 1

5.2.1

We begin by proving the consistency under the situation . First, notice that there are models associated with for specific and . The Bayes factor for any is

(33)

The terms associated with the coefficients of determination can be shown as

(34)

where is due to Lemma 5. It can also be shown that

(35)

Let . Then,

(36)

The summand then can be represented as

(37)

If , then the consistency requires that . On the other hand, suppose that for . Then , which is growing faster than . This implies that the can converge to zero with a mild speed under such circumstance.

5.2.2 and

Under this circumstance, the dimensional penalty helps the procedure to pick up the true model comparing to the previous situation. Define the model set as
, then the total number of models given any and is . The terms in the summand excluding the terms associated with the coefficient of determination can be shown as

(38)

It is straight forward to show that

(39)

where . Combining Equation 38 and Equation 39, it gives us

(40)

Given the condition in the LABEL:tm::under-fitted, the consistency holds for this situation.

5.3 Overfitted Model

Suppose that , for . Define the overfitted model set
associated with a given finite . Notice that the summation over the model set is

(41)

The individual Bayes factor is asymptotically equivalent to

(42)

where and for now. To show , it suffices to show that

(43)

This can be achieved by showing the probability of the event defined as Equation 43 going to 1, such as

(44)

where , and to guarantee the convergence. Notice that

(45)

which means that it suffices to show

(46)

Define a function

(47)

For any , , it can be shown