On the Computational Complexity of High-Dimensional Bayesian Variable Selection

05/29/2015 ∙ by Yun Yang, et al. ∙ 0

We study the computational complexity of Markov chain Monte Carlo (MCMC) methods for high-dimensional Bayesian linear regression under sparsity constraints. We first show that a Bayesian approach can achieve variable-selection consistency under relatively mild conditions on the design matrix. We then demonstrate that the statistical criterion of posterior concentration need not imply the computational desideratum of rapid mixing of the MCMC algorithm. By introducing a truncated sparsity prior for variable selection, we provide a set of conditions that guarantee both variable-selection consistency and rapid mixing of a particular Metropolis-Hastings algorithm. The mixing time is linear in the number of covariates up to a logarithmic factor. Our proof controls the spectral gap of the Markov chain by constructing a canonical path ensemble that is inspired by the steps taken by greedy algorithms for variable selection.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many areas of science and engineering, it is common to collect a very large number of covariates

in order predict a response variable

. We are thus led to instances of high-dimensional regression, in which the number of covariates exceed the sample size . A large literature has emerged to address problems in the regime , where the ill-posed nature of the problem is addressed by imposing sparsity conditions—namely, that the response depends only on a small subset of the covariates. Much of this literature is based on optimization methods, where penalty terms are incorporated that yield both convex [33] and nonconvex [9, 39] optimization problems. Theoretical analysis is based on general properties of the design matrix and the penalty function.

Alternatively, one can take a Bayesian point of view on high-dimensional regression, placing a prior on the model space and performing the necessary integration so as to obtain a posterior distribution [12, 16, 5]

. Obtaining such a posterior allows one to report a subset of possible models along with their posterior probabilities as opposed to a single model. One can also report the marginal posterior probability of including each covariate. Some recent work has provided some theoretical understanding of the performance of Bayesian approaches to variable selection. In the moderate-dimension scenario (in which

is allowed to grow with but ), Shang and Clayton [28] establish posterior consistency for variable selection in a Bayesian linear model, meaning that the posterior probability of the true model that contains all influential covariates tends to one as grows. Narisetty and He [26] consider a high-dimensional scenario in which can grow nearly exponentially with ; in this setting, they show the Bayesian spike-and-slab variable-selection method achieves variable-selection consistency. Since this particular Bayesian method resembles a randomized version of -penalized methods, it could have better performance than -penalized methods for variable selection under high-dimensional settings [26, 29]. Empirical evidence for this conjecture is provided by Guan et al. [13] for SNP selection in genome-wide association studies, but it has not been confirmed theoretically.

The most widely used tool for fitting Bayesian models are sampling techniques based on Markov chain Monte Carlo (MCMC), in which a Markov chain is designed over the parameter space so that its stationary distribution matches the posterior distribution. Despite its popularity, the theoretical analysis of the computational efficiency of MCMC algorithms lags that of optimization-based methods. In particular, the central object of interest is the mixing time of the Markov chain, which characterizes the number of iterations required to converge to an -distance of stationary distribution from any initial configuration. In order for MCMC algorithms to be controlled approximations, one must provide meaningful bounds on the mixing time as a function of problem parameters such as the number of observations and the dimensionality. Of particular interest is determining whether the chain is rapidly mixing—meaning that the mixing time grows at most polynomially in the problem parameters—or slowly mixing meaning that the mixing time grows exponentially in the problem parameters. In the latter case, one cannot hope to obtain approximate samples from the posterior in any reasonable amount of time for large models.

Unfortunately, theoretical analysis of mixing time is comparatively rare in the Bayesian literature, with a larger number of negative results. On the positive side, Jones and Hobert [17]

consider a Bayesian hierarchical version of the one-way random effects model, and obtain upper bounds on the mixing time of Gibbs and block Gibbs samplers as a function of the initial values, data and hyperparameters. Belloni and Chernozhukov 

[4]

show that a Metropolis random walk is rapidly mixing in the dimension for regular parametric models in which the posterior converges to a normal limit. It is more common to find negative results in the literature. Examples include Mossel and Vigoda 

[25], who show that the MCMC algorithm for Bayesian phylogenetics takes exponentially long to reach the stationary distribution as data accumulates, and Woodard and Rosenthal [37], who analyze a Gibbs sampler used for genomic motif discovery and show that the mixing time increases exponentially as a function of the length of the DNA sequence.

The goal of the current paper is to study the computational complexity of Metropolis-Hastings procedures for high-dimensional Bayesian variable selection. For concreteness, we focus our analysis on a specific hierarchical Bayesian model for sparse linear regression, and an associated Metropolis-Hastings random walk, but these choices should be viewed as representative of a broader family of methods. In particular, we study the well-known Zellner -prior for linear regression [38]. The main advantage of this prior is the simple expression that it yields for the marginal likelihood, which is convenient in our theoretical investigations. As in past analyses [26], we consider the marginal probability of including each covariate into the model as being on the order of . Moreover, we restrict the support of the prior to rule out unrealistically large models. As a specific computational methodology, we focus on an iterative, local-move and neighborhood-based procedure for sampling from the model space, which is motivated by the shotgun stochastic search [14].

Our main contribution is to provide conditions under which Bayesian posterior consistency holds, and moreover, the mixing time grows linearly in (up to logarithmic factor), implying that the chain is rapidly mixing. As a by-product, we provide conditions on the hyper-parameter to achieve model-selection consistency. We also provide a counter-example to illustrate that although ruling out unrealistically large models is not necessary for achieving variable-selection consistency, it is necessary in order that the Metropolis-Hastings random walk is rapidly mixing. To be clear, while our analysis applies to a fully Bayesian procedure for variable selection, it is based on a frequentist point of view in assuming that the data are generated according to a true model.

There are a number of challenges associated with characterizing the computational complexity of Markov chain methods for Bayesian models. First, the posterior distribution of a Bayesian model is usually a much more complex object than the highly structured distributions in statistical physics for which meaningful bounds on the Markov chain mixing times are often obtained (e.g. [6], [23], [21]). Second, the transition probabilities of the Markov chain are themselves stochastic, since they depend on the underlying data-generating process. In order to address these challenges, our analysis exploits asymptotic properties of the Bayesian model to characterize the typical behavior of the Markov chain. We show that under conditions leading to Bayesian variable-selection consistency, the Markov chain over the model space has a global tendency of moving towards the true data-generating model, even though the posterior distribution can be highly irregular. In order to bound the mixing time, we make use of the canonical path technique developed by Sinclair [31, 30] and Diaconis and Stroock [8]. More precisely, the particular canonical path construction used in our proof is motivated by examining the solution path of stepwise regression procedures for linear model selection (e.g.,  [40, 2]), where a greedy criterion is used to decide at each step whether a covariate is to be included or deleted from the curent model.

Overall, our results reveal that there is a delicate interplay between the statistical and computational properties of Bayesian models for variable selection. On the one hand, we show that concentration of the posterior is not only useful in guaranteeing desirable statistical properties such as parameter estimation or model-selection consistency, but they also have algorithmic benefits in certifying the rapid mixing of the Markov chain methods designed to draw samples from the posterior. On the other hand, we show that posterior consistency on its own is

not sufficient for rapid mixing, so that algorithmic efficiency requires somewhat stronger conditions.

The remainder of this paper is organized as follows. Section 2 provides background on the Bayesian approach to variable selection, as well as Markov chain algorithms for sampling and techniques for analysis of mixing times. In Section 3, we state our two main results (Theorems 1 and 2) for a class of Bayesian models for variable selection, along with simulations that illustrate the predictions of our theory. Section 4 is devoted to the proofs of our results, with many of the technical details deferred to the appendices. We conclude in Section 5 with a discussion.

2 Background and problem formulation

In this section, we introduce some background on the Bayesian approach to variable selection, as well some background on Markov chain algorithms for sampling, and techniques for analyzing their mixing times.

2.1 Variable selection in the Bayesian setting

Consider a response vector

and a design matrix that are linked by the standard linear model

(1)

and is the unknown regression vector. Based on observing the pair , our goal is to recover the support set of —that is, to select the subset of covariates with non-zero regression weights, or more generally, a subset of covariates with absolute regression weights above some threshold.

In generic terms, a Bayesian approach to variable selection is based on first imposing a prior over the set of binary indicator vectors, and then using the induced posterior (denoted by ) to perform variable selection. Here each binary vector should be thought of as indexing the model which involves only the covariates indexed by . We make use of the shorthand corresponding to the number of non-zero entries in , or the number of active covariates in the associated model. It will be convenient to adopt a dualistic view of as both a binary indicator vector, and as a subset of . Under this identification, the expression for a pair of inclusion vectors can be understood as that the subset of variables selected by is contained in the subset of variables selected by . Similarly, it will be legitimate to use set operators on those indicator vectors, such as , and . Using the set interpretation, we let denote the submatrix formed of the columns indexed by , and we define the subvector in an analogous manner. We make use of this notation in defining the specific hierarchical Bayesian model analyzed in this paper, defined precisely in Section 3.1 to follow.

2.2 MCMC algorithms for Bayesian variable selection

Past work on MCMC algorithms for Bayesian variable selection can be divided into two main classes—Gibbs samplers (e.g., [12, 16, 26]) and Metropolis-Hastings random walks (e.g. [14, 13]). In this paper, we focus on a particular form of Metropolis-Hastings updates.

In general terms, a Metropolis-Hastings random walk is an iterative and local-move based procedure involving three steps:

Step 1:

Use the current state to define a neighborhood of proposal states.

Step 2:

Choose a proposal state in

according to some probability distribution

over the neighborhood, e.g. the uniform distribution.

Step 3:

Move to the new state with probability , and stay in the original state with probability , where the acceptance ratio is given by

(2)

In this way, for any fixed choice of the neighborhood structure , we obtain a Markov chain with transition probability given by

The specific form of Metropolis-Hastings update analyzed in this paper is obtained by randomly selecting one of the following two schemes to update , each with probability .

Single flip update:

Choose an index uniformly at random, and form the new state by setting .

Double flip update:

Define the subsets and let . Choose an index pair uniformly at random, and form the new state by flipping from to and from to . (If the set is empty, then we do nothing.)

This scheme can be understood as a particular of the general Metropolis-Hastings scheme in terms of a neighborhood to be all models that can be obtained from by either changing one component to its opposite (i.e., from to , or from to ) or switching the values of two components with different values.

Letting denote the Hamming distance between and . the overall neighborhood is given by the union , where

With these definitions, the transition matrix of the previously described Metropolis-Hastings scheme takes the form

(3)

2.3 Background on mixing times

Let be an irreducible, aperiodic Markov chain on the discrete state space , and described by the transition probability matrix with stationary distribution . We assume throughout that is reversible; i.e., it satisfies the detailed balance condition for all . It is easy to see that the previously described Metropolis-Hastings matrix satisfies this reversibility condition. It is convenient to identify a reversible chain with a weighted undirected graph on the vertex set , where two vertices and are connected if and only if the edge weight is strictly positive.

For and any subset , we write . If is the initial state of the chain, then the total variation distance to the stationary distribution after iterations is

The -mixing time is given by

(4)

which measures the number of iterations required for the chain to be within distance of stationarity. The efficiency of the Markov chain can be measured by the dependence of on the difficulty of the problem, for example, the dimension of the parameter space and the sample size. In our case, we are interesed in the dependence of on the covariate dimension and the sample size . Of particular interest is whether the chain is rapidly mixing, meaning that the mixing time grows at most polynomially in the pair , or slowly mixing, meaning that the mixing time grows exponentially.

3 Main results and their consequences

The analysis of this paper applies to a particular family of hierarchical Bayes models for variable selection. Accordingly, we begin by giving a precise description of this family of models, before turning to statements of our main results and a discussion of their consequences. Our first result (Theorem 1) provides sufficient conditions for posterior concentration, whereas our second result (Theorem 2) provides sufficient conditions for rapid mixing of the Metropolis-Hastings updates.

3.1 Bayesian hierarchical model for variable selection

In addition to the standard linear model (1), the Bayesian hierarchical model analyzed in this paper involves three other ingredients: a prior over the precision parameter

(or inverse noise variance) in the linear observation model, a prior on the regression coefficients, and a prior over the binary indicator vectors. More precisely, it is given by

(5a)
(5b)
(5c)
(5d)

For each model , there are three parameters to be specified: the integer is a prespecified upper bound on the maximum number of important covariates, the hyperparameter controls the degree of dispersion in the regression prior, and the hyperparameter penalizes models with large size. For a given integer , we let the class of all models involving at most covariates.

Let us make a few remarks on our choice of Bayesian model. First, the choice of covariance matrix in the regression prior—namely, involving —is made for analytical convenience, in particular in simplifying the posterior. A more realistic choice would be the independent prior

However, the difference between these choices will be negligible when , which, as shown by our theoretical analysis, is the regime under which the posterior is well-behaved. Another popular choice for the prior of is the spike-and-slab prior [16], where for each covariate , one specifies the marginal prior for

as a mixture of two normal distributions, one with a substantially larger variance than the other, and

can be viewed as the latent class indicator for this mixture prior. Our primary motivation in imposing Zellner’s -prior is in order to streamline the theoretical analysis: it leads to an especially simple form of the marginal likelihood function. However, we note that our conclusions remain valid under essentially the same conditions when the independent prior or the spike-and-slab prior is used, but with much longer proofs. The sparsity prior on is similar to the prior considered by Narisetty and He [26] and Castillo et al. [7]. The

decay rate for the marginal probability of including each covariate imposes a vanishing prior probability on the models of diverging sizes. The only difference is that we put a constraint

to rule out models with too many covariates. As will be clarified in the sequel, while this additional constraint is not needed for Bayesian variable-selection consistency, it is necessary for rapid mixing of the MCMC algorithm that we analyze.

Recall from our earlier set-up that the response vector is generated from the standard linear model , where , is the unknown regression vector, and

the unknown noise standard deviation. In rough terms, the goal of variable selection is to determine the subset

of “influential” covariates. In order to formalize this notion, let us fix a constant depending on that quantifies the minimal signal size requirement for a covariate to be “influential”. We then define to consist of the indices with relatively large signal—namely

(6)

and our goal is to recover this subset. Thus, the “non-influential” coefficients are allowed to be non-zero, but their magnitudes are constrained.

We let be the indicator vector that selects the influential covariates, and let be the size of the corresponding “true” model . Without loss of generality, we may assume that the first components of are ones, and the rest are zeros. We assume throughout this section that we are in the high-dimensional regime where , since the low-dimensional regime where is easier to analyze. For any symmetric matrix , let and

denote its smallest and largest eigenvalues. Our analysis involves the following assumptions:

Assumption A (Conditions on ):

The true regression vector has components that satisfy the bounds

(7a)
for some universal constant .

In the simplest case, the true regression vector is -sparse (meaning that ), so that the off-support condition holds trivially. As for the full condition, it is known [28] that some form of upper bound on the norm in terms of the -hyperparameter is required in order to prove Bayesian model selection consistency [28]. The necessity of such a condition is a manifestation of the so-called information paradox of -priors [22].

Our next assumption involves an integer parameter , which is set either to a multiple of the true sparsity (in order to prove posterior concentration) or the truncated sparsity (in order to prove rapid mixing).

Assumption B (Conditions on the design matrix):

The design matrix has been normalized so that for all ; moreover, letting , there exist constants and such that

(7b)

where denotes projection onto the span of . The lower restricted eigenvalue condition is a mild requirement, and one that plays a role in the information-theoretic limitations of variable selection [34]. On the other hand, the sparse projection condition can always be satisfied by choosing . To see this, notice that and there are at most different choice of distinct pair . Therefore, by the Gaussianity of , the sparse projection condition always holds with . On the other extreme, if the design matrix has orthogonal columns, then . As a consequence, due to the same argument, the sparse projection condition holds with , which depends neither on nor on .

Assumption C (Choices of prior hyperparameters):

The noise hyperparameter and sparsity penalty hyperparameter are chosen such that

(7c)

In the low-dimensional regime , the -prior with either the unit information prior , or the choice have been recommended [18, 10, 32]. In the intermediate regime where , Sparks et al. [32] show that must grow faster than for the Bayesian linear model without variable selection to achieve posterior consistency. These considerations motivate us to choose the hyperparameter for the high-dimensional setting as for some , and our theory establishes the utility of this choice.

Assumption D (Sparsity control):

For a constant , one of the two following conditions holds:

Version D:

We set in the sparsity prior (5d), and the true sparsity is bounded as for some constant .

Version D:

The sparsity parameter in the prior (5d) satisfies the sandwich relation

(7d)

where .

Assumptions A, B, C and D are a common set of conditions assumed in the existing literature (e.g., [28, 26]) for establishing Bayesian variable-selection consistency; i.e., that the posterior probability of the true model as .

3.2 Sufficient conditions for posterior consistency

Our first result characterizes the behavior of the (random) posterior . As we mentioned in Section 2.1, Bayesian variable-selection consistency does not require that the sparsity prior (5d) be truncated at some sparsity level much less than , so that we analyze the hierarchical model with , and use the milder Assumption D. The reader should recall from equation (6) the threshold parameter that defines the subset of influential covariates.

Theorem 1 (Posterior concentration).

Suppose that Assumption A, Assumption B with , Assumption C, and Assumption D hold. If the threshold satisfies

(8)

then we have with probability at least .

The threshold condition (8) requires the set of influential covariates to have reasonably large magnitudes; this type of signal-to-noise condition is needed for establishing variable selection consistency of any procedure [34]. We refer to it as the -condition in the rest of the paper. Due to the mildness of Assumption A (conditions on ), the claim in the theorem holds even when the true model is not exactly sparse: Assumption A allows the residual to be nonzero as long as it has small magnitude.

It is worth noting that the result of Theorem 1 covers two regimes, corresponding to different levels of signal-to-noise ratio. More precisely, it is useful to isolate the following two mutually exclusive possibilities:

High SNR: (9a)
Low SNR: (9b)

In terms of the parameter in Assumption A, The high SNR regime corresponds to , whereas the low SNR regime corresponds to . The intuition for the low SNR setting is that the signal in every component is so weak that the “penalty” induced by hyperparameters completely overwhelms it. Theorem 1 guarantees that the posterior concentrates around the model under the high SNR condition, and under the null model under the low SNR condition. More precisely, we have:

Corollary 1.

Under the conditions of Theorem 1, with probability at least :

  1. Under the high SNR condition (9a), we have .

  2. Conversely, under the low SNR condition (9b), we have .

Corollary 1 provides a complete characterization of the high or low SNR regimes, but it does not cover the intermediate regime, in which some component of is sandwiched as

(10)

On one hand, Theorem 1 still guarantees a form of Bayesian variable selection consistency in this regime. However, the MCMC algorithm for sampling from the posterior can exhibit slow mixing due to multimodality in the posterior. In Appendix A.2, we provide a simple example that satisfies the conditions of Theorem 1, so that posterior consistency holds, but the Metropolis-Hastings updates have mixing time growing exponentially in . This example reveals a phenomenon that might seem counter-intuitive at first sight: sharp concentration of the posterior distribution need not lead to rapid mixing of the MCMC algorithm.

3.3 Sufficient conditions for rapid mixing

With this distinction in mind, we now turn to developing sufficient conditions for Metropolis-Hastings scheme (3) to be rapidly mixing. As discussed in Section 2, this rapid mixing ensures that the number of iterations required to converge to an -ball of the stationary distribution grows only polynomially in the problem parameters. The main difference in the conditions is that we now require Assumption B—the RE and sparse projection conditions—to hold with parameter , as opposed to with the smaller parameter involved in Theorem 1.

Theorem 2 (Rapid mixing guarantee).

Suppose that Assumption A, Assumption B with , Assumption C, and Assumption D() all hold. Then under either the high SNR condition (9a) or the low SNR condition (9b), there are universal constants such that, for any , the -mixing time of the Metropolis-Hastings chain (3) is upper bounded as

(11)

with probability at least .

According to our previous definition (4) of the mixing time, Theorem 2 characterizes the worst case mixing time, meaning the number of iterations when starting from the worst possible initialization. If we start with a good intial state—for example, the true model would be a nice though impractical choice—then we can remove the term in the upper bound (11). In this way, the term can be understood as the worst-case number of iterations required in the burn-in period of the MCMC algorithm.

Theorem 1 and Theorem 2 lead to the following corollary, stating that after iterations, the MCMC algorithm will output with high probability.

Corollary 2.

Under the conditions of Theorem 2, for any fixed iterate such that

the iterate from the MCMC algorithm matches with probability at least .

As with Corollary 1, Theorem 2 does not characterize the intermediate regime in which some component of satisfies the sandwich inequality (10). Based on our simulations, we suspect that the Markov chain might be slowly mixing in this regime, but we do not have a proof of this statement.

3.4 Illustrative simulations

In order to illustrate the predictions of Theorem 2, we conducted some simulations. We also provide an example for which a frequentist method such as the Lasso fails to perform correct variable selection while our Bayesian method succeeds.

3.4.1 Comparison of mixing times

In order to study mixing times and their dependence on the model structure, we performed simulations for linear models with random design matrices, formed by choosing row

i.i.d. from a multivariate Gaussian distribution. In detail, setting the noise variance

, we considered two classes of linear models with random design matrices , in each case formed with i.i.d. rows :

Independent design:
Correlated design:

In all cases, we choose a design vector with true sparsity , taking the form

where is a signal-to-noise parameter. Varying the parameter SNR allows us to explore the behavior of the chains when the model lies on the boundary of the -condition. We performed simulations for the SNR parameter , sample sizes , and number of covariates . In all cases, we specify our prior model by setting the dispersion hyperparameter and the expected maximum model size .

Figure 1 plots the typical trajectories of log-posterior probability versus the number of iterations of the Markov chain under the independent design. In the strong signal regime (), the true model receives the highest posterior probability, and moreover the Metropolis-Hastings chain converges rapidly to stationarity, typically within iterations. This observation is confirmation of our theoretical prediction of the behavior when all nonzero components in have relative high signal-to-noise ratio (). In the intermediate signal regime (), Bayesian variable-selection consistency typically fails to hold, and here, we find that the chain converges even more quickly to stationarity, typically within iterations. This observation cannot be fully explained by our theory. A simulation to follow using a correlated design shows that it is not a robust phenomenon: the chain can have poor mixing performance in this intermediate signal regime when the design is sufficiently correlated.

(a) (b)
Figure 1: Log-posterior probability versus the number of iterations (divided by the number of covariates ) of randomly initialized Markov chains with , and in the independent design. In all cases, each grey curve corresponds to one trajectory of the chain ( chains in total). Half of the chains are initialized at perturbations of the null model and half the true model. (a) Weak signal case: . (b) Strong signal case: (the posterior probability of the true model coincides with that of the highest probability model).

In order to gain further insight into the algorithm’s performance, for each pair we ran the Metropolis-Hastings random walk based on six initializations: the first three of them are random perturbations of the null model, whereas the remaining three are the true model. We made these choices of initialization because our empirical observations suggest that the null model and the true model tend to be near local modes of the posterior distribution. We run the Markov chain for iterations and use the Gelman-Rubin (GR) scale factor [11] to detect whether the chains have reached stationarity. More precisely, we calculate the GR scale factor for the coefficient of determination summary statistics

where denotes the projection matrix onto the span of . Since the typical failing of convergence to stationarity is due to the multimodality of the posterior distribution, the GR scale factor can effectively detect the problem. If the chains fail to converge, then the GR scale factor will be much larger than ; otherwise, the scale factor should be close to . Convergence of the chain within at most iterations provides empirical confirmation of our theoretical prediction that the mixing time grows at most linearly in the covariate dimension . (As will be seen in our empirical studies, the sample size and have little impact on the mixing time, as long as remains small compared to .)

SP 100 100 100 100
H-T 113.4 24.6 0 0
N-T 113.4 11.4 -210.9 -383.6
SP 100 100 100 100
H-T 148.7 33.2 0 0
N-T 148.7 17.4 -216.6 -395.9
SP 100 100 100 100
H-T 117.1 34.8 0 0
N-T 117.1 -6.9 -342.4 -649.5
SP 100 100 100 100
H-T 160.4 32.8 0 0
N-T 160.4 -4.2 -377.6 -743.4
Table 1: Convergence behaviors of the Markov chain methods with sample sizes , ambient dimensions , and in the independent design. SP: proportion of successful trials (in which GR); H-T: log posterior probability difference between the highest probability model and the true model; N-T: log posterior probability difference between the null model and the true model. Each quantity is computed based on 20 simulated datasets.
SP 100 95 80 100
H-T 123.4 75.2 0 0
N-T 123.4 71.2 -107.3 -275.8
SP 100 15 100 100
H-T 170.0 81.0 0 0
N-T 170.0 78.7 -102.1 -288.9
SP 100 100 100 100
H-T 138.7 75.1 0 0
N-T 138.7 -67.0 -180.8 -431.7
SP 100 100 100 100
H-T 161.8 61.9 0 0
N-T 161.8 -58.8 -204.2 -445.4
Table 2: Convergence behavior of the Markov chain methods with sample size , ambient dimension , and parameter for the case of correlated design. SP: proportion of successful trials (in which GR ); H-T: log posterior probability difference between the highest probability model and the true model; N-T: log posterior probability difference between the null model and the true model. Each quantity is computed based on simulated datasets.

We report the percentage of simulated datasets for which the GR scale factor from six Markov chains is less than (success). Moreover, to see whether the variable-selection procedure based on the posterior is consistent, we also compute the difference between the highest posterior probability found during the Markov chain iterations and the posterior probability of the true model (H-T) and the difference in posterior probabilities between the null model and the true model (N-T). If the true model receives the highest posterior probability, then H-T would be ; if the null model receives the highest posterior probability, then N-T would be the same as H-T.

Table 1 shows the results for design matrices drawn from the independent ensemble. In this case, the Markov chain method has fast convergence in all settings (it converges within iterations). From the table, the setting (respectively ) corresponds to the weak (respectively strong) signal regime, while is in the intermediate regime where neither the null model nor the true model receives the highest posterior probability. Table 2 shows the results for design matrices drawn from the correlated ensemble. Now the Markov chain method exhibits poor convergence behavior in the intermediate regime with , but still has fast convergence in the weak and strong signal regimes. However, with larger sample size , the Markov chain has fast convergence in all settings on and SNR. Comparing the results under the two different designs, we find that correlations among the covariates increases the difficulty of variable-selection tasks when Markov chain methods are used. Moreover, the results under the correlated design suggest that there exists a regime, characterized by , and SNR, in which the Markov chain is slowly mixing. It would be interesting to see whether or not this regime characterizes some type of fundamental limit on computationally efficient procedures for variable selection. We leave this question open as a possible future direction.

3.4.2 Bayesian methods versus the Lasso

Our analysis reveals one possible benefit of a Bayesian approach as opposed to -based approaches such as the Lasso. It is well known that the performance of the Lasso and related -relaxations depends critically on fairly restrictive incoherence conditions on the design matrix. Here we provide an example of an ensemble of linear regression problems for which the Lasso fails to perform correct variable selection whereas the Bayesian approach succeeds with high probability.

For Lasso-based methods, the irrepresentable condition

(12)

is both sufficient and necessary for variable-selection consistency [24, 41, 35]. In our theory for the Bayesian approach, the analogous conditions are the upper bound in Assumption D on the maximum model size, namely

(13)

as well as the sparse projection condition in Assumption B. Roughly speaking, the first condition is needed to ensure that saturated models, i.e., models with size , receive negligible posterior probability, such that if too many unimportant covariates are included the removal of some of them does not hurt the goodness of fit (see Lemma 8 in the Appendix). This condition is weaker than the irrepresentable condition since we can always choose large enough so that holds, as long as Assumption B is not violated.

As an example, consider a design matrix that satisfies

with . (When , we may consider instead a random design where the rows of are generated i.i.d. from the -variate normal distribution .) This example was previously analyzed by Wainwright [34], who shows that it is an interesting case in which there is a gap between the performance of -based variable-selection recovery and that of an optimal (but computationally intractable) method based on searching over all subsets. For a design matrix of this form, we have , so that the irrepresentable condition fails if . Consequently, by known results on the necessity of the irrepresentable condition for Lasso [41, 35], it will fail in performing variable selection for this ensemble.

On the other hand, for this example, it can be verified that Assumption D is satisfied with , and moreover, that the the RE condition in Assumption B holds with , whereas the sparse projection condition is satisfied with . The only consequence for taking larger values of is in the -condition: in particular, the threshold is always lower bounded by . Consequently, our theory shows that the Bayesian procedure will perform correct variable selection with high probability for this ensemble.

Figure 2: Boxplots indicating variable-selection performance of the Bayesian approach (BVS) and the Lasso. The boxplots are based on the logarithms of the ratio between the posterior probability of the selected model and the true model over replicates. The model selected by the Bayesian approach is the median probability model [3] and the regularization parameter of the Lasso is chosen by cross-validation.

To compare the performance of the Bayesian approach and the Lasso under this setup, we generate our design matrix from a Gaussian version of this ensemble; i.e., the rows of are generated i.i.d. from the -variate normal distribution . We choose so that , i.e. the irrepresentable condition fails. Figure 2 shows the variable-selection performance for the Bayesian approach and the Lasso over replicates. We report the logarithm of the ratio between the posterior probability (see equation (42)) of the selected model and the true model, where we use the median probability model [3] as the selected model of the Bayesian approach. If a variable-selection approach has good performance, then we will expect this logarithm to be close to zero. Figure 2 shows that the Bayesian approach almost always selects the true model while the Lasso fails most of the time, which is consistent with the theory.

4 Proofs

We now turn to the proofs of our main results, beginning with the rapid mixing guarantee in Theorem 2, which is the most involved technically. We then use some of the machinery developed in Theorem 2 to prove the posterior consistency guarantee in Theorem 1. Finally, by combining these two theorems we prove Corollary 2. In order to promote readability, we defer the proofs of certain more technical results to the appendices.

4.1 Proof of Theorem 2

For the purposes of this proof, let denote the transition matrix of the original Metropolis-Hastings sampler (3). Now consider instead the transition matrix , corresponding to a lazy random walk that has a probability of at least in staying in its current state. By construction, the smallest eigenvalue of will always be nonnegative, and as a consequence, the mixing time of the Markov chain is completely determined by the second largest eigenvalue of . The difference is known as the spectral gap, and for any lazy Markov chain, we have the sandwich relation

(14)

See the papers [30, 37] for bounds of this form.

Using this sandwich relation, we claim that it suffices to show that there are universal constants such that with probability at least , the spectral gap of the lazy transition matrix is lower bounded as

(15)

To establish the sufficiency of this intermediate claim, we apply Theorem 1 and make use of the expression (42) for the posterior distribution, thereby obtaining that for , the posterior probability is lower bounded as

with probability at least . Combining the above two displays with the sandwich relation (14), we obtain that there exist constants such that for ,

with probability at least .

Accordingly, the remainder of our proof is devoted to establishing the spectral gap bound (15), and we do so via a version of the canonical path argument [30]. Let us begin by describing the idea of a canonical path ensemble associated with a Markov chain. Given a Markov chain with state space , consider the weighted directed graph with vertex set and edge set

in which an ordered pair

is included as an edge with weight if and only if . A canonical path ensemble for is a collection of paths that contains, for each ordered pair of distinct vertices, a unique simple path in the graph that connects and . We refer to any path in the ensemble as a canonical path.

In terms of this notation, Sinclair [30] shows that for any reversible Markov chain and any choice of canonical path , the spectral gap of is lower bounded as

(16)

where corresponds to the length of a longest path in the ensemble , and the quantity is known as the path congestion parameter.

In order to apply this approach to our problem, we need to construct a suitable canonical path ensemble . To begin with, let us introduce some notation for operations on simple paths. For two given paths and :

  • Their intersection corresponds to the subset of overlapping edges. (For instance, if and , then .)

  • If , then denotes the path obtained by removing all edges in from . (With the same specific choices of as above, we have .)

  • We use to denote the reverse of . (With the choice of as above, we have .)

  • If the endpoint of and the starting point of are the same, then we define the union as the path that connects and together. (If and , then their union is given by .)

We now turn to the construction of our canonical path ensemble. At a high level, our construction is inspired by the variable-selection paths carved out by greedy stepwise variable-selection procedures (e.g.,  [40, 2]).

Canonical path ensemble construction for :

First, we construct the canonical path from any to the true model . The following construction will prove helpful. We call a set of canonical paths memoryless with respect to the central state if: (1) for any state satisfying , there exists a unique simple path in that connects and ; (2) for any intermediate state on any path in , the unique path in that connects and is the sub-path of starting from and ending at . Intuitively, this memoryless property means that for any intermediate state on any canonical path towards the central state, the next move from this intermediate state towards the central state does not depend on the history. A memoryless canonical path ensemble has the property that in order to specify the canonical path connecting any state and the central state , we only need to specify which state to move to from any in ; i.e., we need a transition function that maps the current state to a next state . For simplicity, we define to make as the domain of . Clearly, each memoryless canonical path ensemble with respect to a central state corresponds to a transition function with , but the converse is not true. For example, if there exist two states and so that and , then is not the transition function corresponding to any memoryless canonical path ensemble. However, every valid transition function gives rise to a unique memoryless canonical path set consisting of paths connecting any to , with corresponding to the fixed point of . We call function a valid transition function if there exists a memoryless canonical path set for which is the corresponding transition function. The next lemma provides a suffcient condition for a function to be valid, which motivates our construction to follow. Recall that denotes the Hamming metric between a pair of binary strings.

Lemma 1.

If a function satisfies that for any state , the Hamming distance between and is strictly less than the Hamming distance between and , then is a valid transition function.

Proof.

Based on this function , we can construct the canonical path from any state to by defining as , where denotes the -fold self-composition of for any and . In order to show that the set is a memoryless canonical path set, we only need to verify two things:

  1. for any ,