Spike-and-Slab Group Lassos for Grouped Regression and Sparse Generalized Additive Models

03/05/2019 ∙ by Ray Bai, et al. ∙ University of Florida University of Pennsylvania 0

We introduce the spike-and-slab group lasso (SSGL) for Bayesian estimation and variable selection in linear regression with grouped variables. We further extend the SSGL to sparse generalized additive models (GAMs), thereby introducing the first nonparametric variant of the spike-and-slab lasso methodology. Our model simultaneously performs group selection and estimation, while our fully Bayes treatment of the mixture proportion allows for model complexity control and automatic self-adaptivity to different levels of sparsity. We develop theory to uniquely characterize the global posterior mode under the SSGL and introduce highly efficient coordinate ascent algorithms for maximum a posteriori (MAP) estimation. We further employ de-biasing methods to provide uncertainty quantification of our MAP estimates. Thus, implementation of our model avoids the computational intensiveness of Markov chain Monte Carlo (MCMC) in high dimensions. We derive posterior concentration rates for both grouped linear regression and sparse GAMs when the number of covariates grows at nearly exponential rate with sample size. Finally, we illustrate our methodology through extensive simulations and data analysis.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Regression with Grouped Variables

Group structure of covariates arises in many statistical applications. For example, in multifactor analysis of variance, multi-level categorical predictors are each represented by a group of dummy variables. In genomics, genes within the same pathway may form a group at the pathway or gene set level and act in tandem to regulate a biological system. In each of these scenarios, the response

can be modeled as a linear regression problem with groups:

(1.1)

where ,

is a coefficients vector of length

, and is an covariate matrix corresponding to group .

It is often of practical interest to select groups of variables that are most significantly associated with the response. To facilitate this group-level selection, Yuan and Lin [40] introduced the group lasso, which solves the optimization problem,

(1.2)

where is the norm. In the frequentist literature, many variants of model (1.2) have been introduced, which use some combination of and penalties on the coefficients of interest (e.g., [10, 14, 28]).

Even in the absence of grouping information about the covariates, the model (1.1) subsumes a wide class of important nonparametric regression models called generalized additive models (GAMs). In GAMs, continuous covariates may be represented by groups of basis functions which have a nonlinear relationship with the response. We defer further discussion of this class of nonparametric models to Section 5. We note that for grouped linear regression, we use to denote the number of groups, while for both grouped regression and GAMs, we use to denote the total number of covariates.

In the Bayesian framework, selection of relevant groups under model (1.1) has often been carried out by placing spike-and-slab priors on each of the groups (e.g., [1, 16, 21, 38]). These priors typically take the form,

(1.3)

where is the mixing proportion, represents a point mass at (the “spike”), and

is an appropriate “slab” density (typically a multivariate normal distribution or a scale-mixture multivariate normal density). With a well-chosen prior on

, this model will adapt to the underlying sparsity level and favor parsimonious models in very high dimensions.

1.2 The Spike-and-Slab Lasso

For Bayesian variable selection, point mass spike-and-slab priors (1.3) are interpretable, but they are computationally intractable in high dimensions. As an alternative, fully continuous variants of spike-and-slab models have been developed. In the context of sparse normal means estimation and univariate linear regression, Ročková [24] and Ročková and George [26] introduced the spike-and-slab lasso (SSL). The SSL places a mixture prior of two Laplace densities on the individual coordinates , i.e.

(1.4)

where

represents a univariate Laplace density indexed by hyperparameter

, i.e. . Typically, so that the spike is heavily concentrated about and closely resembles the limiting ideal of a point mass at . Under (1.4), the global posterior mode may be exactly sparse, while the “slab” stabilizes posterior estimates of the larger coefficients so they are not downward biased. Thus, the SSL posterior global mode can be used to perform variable selection and estimation simultaneously.

Since its inception, the spike-and-slab lasso methodology has been adopted for a wide number of statistical problems. Apart from univariate linear regression, it has been used for factor analysis [25], multivariate regression [5], covariance/precision matrix estimation [5, 6], generalized linear models (GLMs) [33, 31], and Cox proportional hazards models [32].

Although the SSL induces sparsity on the individual coefficients, its potential for regression with grouped variables and nonparametric regression has largely been overlooked. To encourage a “grouping” effect for variable selection in GLMs, Tang et al. [31]

introduced latent variables for group-specific probabilities along with the SSL on the individual coefficients. However, to our knowledge, no previous authors have developed an SSL model for the type of “all in, all out” group variable selection that is at the heart of the original group lasso model of

Yuan and Lin [40]. The prior we introduce in this manuscript is also quite different from that of [31]. It is not clear if the prior formulation in Tang et al. [31] can be applied to nonparametric regression, whereas our prior can.

In this paper, we introduce the spike-and-slab group lasso (SSGL) for Bayesian grouped regression and variable selection. To widen the use of spike-and-slab lasso methodology for situations where the linear model is too inflexible, we extend the SSGL to sparse generalized additive models by introducing the nonparametric spike-and-slab lasso (NPSSL). To our knowledge, our work is the first to apply the spike-and-slab lasso methodology outside of a parametric setting. Our methodological, computational, and theoretical contributions can be summarized as follows:

  1. We propose a new group spike-and-slab prior for estimation and variable selection in both parametric and nonparametric settings. Unlike frequentist methods which rely on separable penalties, our model has a non-separable and self-adaptive penalty which allows us to automatically control the model complexity and adapt to ensemble information about sparsity.

  2. We introduce a highly efficient Bayesian EM algorithm for global posterior mode estimation. This allows us to rapidly identify significant groups of coefficients, while thresholding out insignificant ones.

  3. We show that de-biasing techniques that have been used for the original lasso [34] can be extended to our SSGL model to provide valid inference on the estimated regression coefficients.

  4. For both grouped regression and sparse additive models, we derive near-optimal posterior contraction rates for both the regression coefficients and the unknown variance under the SSGL prior.

The rest of the paper is structured as follows. In Section 2, we introduce the spike-and-slab group lasso (SSGL). In Section 3, we characterize the global posterior mode and introduce efficient algorithms for fast maximum a posteriori (MAP) estimation and selection. In Section 4, we utilize ideas from the de-biased lasso to perform inference on the SSGL model. In Section 5, we extend the SSGL to nonparametric settings by proposing the nonparametric spike-and-slab lasso (NPSSL). In Section 6, we present asymptotic theory for the SSGL and the NPSSL. Finally, in Sections 7 and 8, we provide extensive simulation studies and use our models to analyze real data sets.

1.3 Notation

We use the following notations for the rest of the paper. For two nonnegative sequences and , we write to denote . If , we write or . We use or to denote that for sufficiently large , there exists a constant independent of such that .

For a vector , we let , , and denote the , , and norms respectively. For a symmetric matrix , we let and

denote its minimum and maximum eigenvalues. For a set

, we denote its cardinality as .

2 The Spike-and-Slab Group Lasso

Let denote a real-valued vector of length . We define the group lasso density as

(2.1)

where . This type of prior has been previously considered by [13] and [38]

for Bayesian inference in the grouped regression model (

1.1). Kyung et al. [13] considered a single prior (2.1) on each of the ’s, while Xu and Ghosh [38] employed (2.1) as the “slab” prior in the point-mass mixture (1.3). However, these authors did not provide an explicit expression for the normalizing constant

. These other authors also scaled the prior by the unknown standard deviation

. In this paper, we will instead place an independent prior on the variance , in accordance with the recommendations of Moran et al. [20].

In this manuscript, we introduce a continuous spike-and-slab prior with the group lasso density (2.1) for both the “spike” and the “slab” components. The continuous nature of our prior is critical in allowing for efficient coordinate ascent algorithms for MAP estimation. Letting under model (1.1), the spike-and-slab group lasso (SSGL) is defined as:

(2.2)

where denotes the group lasso density (2.1) indexed by hyperparameter , and is a mixing proportion. corresponds to the spike which shrinks the entire vector towards , while corresponds to the slab. For shorthand notation, we denote as and as going forward.

Under the linear regression model with grouped variables (1.1), we place the SSGL prior (2.2) on . To complete the prior specification, we place an independent Jeffreys prior on , i.e.

(2.3)

The mixing proportion in (2.2) can either be fixed deterministically or endowed with a prior . We will discuss this in detail in Section 3.

3 Characterization and Computation of the Global Posterior Mode

Throughout this section, we let denote the total number of covariates, i.e. . Our goal is to find the maximum a posteriori estimates of the regression coefficients . This optimization problem is equivalent to a penalized likelihood method in which the logarithm of the prior (2.2) may be reinterpreted as a penalty on the regression coefficients. Similarly to Ročková and George [26], we will leverage this connection between the Bayesian and frequentist paradigms and introduce the SSGL penalty. This strategy combines the adaptivity of the Bayesian approach with the computational efficiency of existing algorithms in the frequentist literature.

A key component of the SSGL model is , the prior expected proportion of groups with large coefficients. Ultimately, we will pursue a fully Bayes approach and place a prior on , allowing the SSGL to adapt to the underlying sparsity of the data and perform automatic multiplicity adjustment [27]. For ease of exposition, however, we will first consider the case where is fixed, echoing the development of Ročková and George [26]. In this situation, the regression coefficients are conditionally independent a priori, resulting in a separable SSGL penalty. Later we will consider the fully Bayes approach, which will yield the non-separable SSGL penalty.

Definition 1.

Given , the separable SSGL penalty is defined as

(3.1)

where

(3.2)

The separable SSGL penalty is almost the logarithm of the original prior (2.2); the only modification is an additive constant to ensure that [26]. The connection between the SSGL and penalized likelihood methods is made clearer when considering the derivative of the separable SSGL penalty, given in the following lemma.

Lemma 3.1.

The derivative of the separable SSGL penalty satisfies

(3.3)

where

(3.4)

Similarly to the SSL, the SSGL penalty is a weighted average of the two regularization parameters, and . The weight is the conditional probability that was drawn from the slab distribution rather than the spike. Hence, the SSGL features an adaptive regularization parameter which applies different amounts of shrinkage to each group, unlike the group lasso which applies the same shrinkage to each group.

3.1 Optimization

Similarly to the group lasso [40], the separable nature of the penalty (1) lends itself naturally to a block coordinate ascent algorithm which cycles through the groups. In this section, we first outline the group updates resulting from the Karush-Kuhn-Tucker (KKT) conditions. We then derive a more refined condition for the global mode to aid in optimization for multimodal posteriors.

The KKT conditions provide a necessary condition for the global posterior mode. We assume that within each group, covariates are orthonormal, i.e. for . This assumption was also used by Yuan and Lin [40] in the original group lasso. If the orthonormal assumption does not hold, then the within-group covariates can be orthonormalized as a preprocessing step, as is often done in practice. For example, in most practical implementations of penalized nonparametric regression using basis expansions, the matrices of basis functions are typically orthonormalized in order to facilitate efficient group coordinate descent algorithms.

Proposition 3.1.

The necessary conditions for to be a global mode are:

(3.5)
(3.6)

Equivalently,

(3.7)

where

Proof.

Follows immediately from Lemma 3.1 and subdifferential Calculus. ∎

The above characterization for the global mode is necessary, but not sufficient. A more refined characterization may be obtained by considering the group-wise optimization problem, noting that the global mode is also a maximizer of the th group, keeping all other groups fixed.

Proposition 3.2.

The global mode if and only if , where

(3.8)
Proof.

See Appendix B. ∎

Unfortunately, the threshold is difficult to compute. We instead find an approximation to this threshold. An upper bound is simply that of the soft-threshold solution (3.7), with . However, when is large, this bound may be improved. Similarly to Ročková and George [26], we provide improved bounds on the threshold in Theorem 3.1. This result requires the function , defined as:

Theorem 3.1.

When and , the threshold is bounded by:

(3.9)

where

(3.10)
(3.11)

and

(3.12)

When is large, and the lower bound on the threshold approaches the upper bound, yielding the approximation . We will ultimately use this approximation in our block coordinate ascent algorithm.

As discussed earlier, a key reason for adopting a Bayesian strategy is that it allows the procedure to adapt to the underlying sparsity in the data. This adaptivity is achieved by placing a prior on , the proportion of groups with non-zero coefficients. We now outline this fully Bayes strategy and the resulting non-separable SSGL penalty.

3.2 The Non-Separable SSGL penalty

We now introduce the non-separable SSGL penalty. With the inclusion of the prior , the marginal prior for the regression coefficients has the following form:

(3.13)
(3.14)

The non-separable SSGL penalty is then defined similarly to the separable penalty, where again we have centered the penalty to ensure .

Definition 2.

The non-separable SSGL (NS-SSGL) penalty with is defined as

(3.15)

Although the penalty (3.14) appears intractable, intuition is again obtained by considering the derivative. Following the same line of argument as Ročková and George [26], the derivative of (3.14) is given in the following lemma.

Lemma 3.2.
(3.16)

where

(3.17)

and

(3.18)

That is, the marginal prior from (3.14) is rendered tractable by considering the each group of regression coefficients separately, conditional on the remaining coefficients. Such a conditional strategy is motivated by the group-wise updates for the separable penalty considered in the previous section. Thus, our optimization strategy for the non-separable penalty will be very similar to the separable case, except instead of a fixed value for

, we will impute the mean of

conditioned on the remaining regression coefficients.

We now consider the form of the conditional mean, . As noted by Ročková and George [26], when the number of groups is large, this conditional mean can be replaced by ; we will proceed with the same approximation. For the prior on , we will use the standard beta prior . With the choices and for these hyperparameters, this prior results in automatic multiplicity adjustment for the regression coefficients [27].

We now examine the conditional distribution . Suppose that the number of groups with non-zero coefficients is , and assume without loss of generality that the first groups have non-zero coefficients. Then,

(3.19)

with and . Similarly to Ročková and George [26]

, this distribution is a generalization of the Gauss hypergeometric distribution. Consequently, the expectation may be written as

(3.20)

While the above expression (3.20) appears laborious to compute, it admits a much simpler form when is very large. Using a slight modification to the arguments of [26], we obtain this simpler form in Lemma 3.3.

Lemma 3.3.

Assume is distributed according to (3.19). Let be the number of groups with non-zero coefficients. Then as ,

(3.21)
Proof.

See Appendix B. ∎

We note that the expression (3.21) is essentially the usual posterior mean of under a beta prior. Intuitively, as diverges, the weights concentrate at zero and one, yielding the familiar form for . With this in hand, we are now in a position to outline the block coordinate ascent algorithm for the non-separable SSGL.

3.3 Optimization

The KKT conditions for the non-separable SSGL penalty yield the following necessary condition for the global mode:

(3.22)

where and is the mean (3.21), conditioned on the previous value of .

As before, (3.22) is sufficient for a local mode, but not the global mode. When and is large, the posterior will be highly multimodal. As in the separable case, we require a refined thresholding scheme that will eliminate some of these suboptimal local modes from consideration. In approximating the group-wise conditional mean with , we do not require group-specific thresholds. Instead, we can use the threshold given in Proposition 3.2 and Theorem 3.1 where is replaced with the current update (3.21). In particular, we shall use the upper bound in our block coordinate ascent algorithm.

Similarly to Ročková and George [26], we combine the refined threshold, with the soft thresholding operation (3.22), to yield the following update for at iteration :

(3.23)

where . Technically, should be updated after each group is updated; in practice, however, there will be little change after one group is updated and so our strategy will be to update both and after every iterations with a default value of .

With the Jeffreys prior , the error variance also has a closed form update:

(3.24)

3.4 Implementation Strategies

When the spike parameter is very large, the continuous spike density approximates the point-mass spike. Consequently, we face a similar computational challenge of navigating a highly multimodal posterior. To ameliorate this problem for the spike-and-slab lasso, Ročková and George [26] recommend a “dynamic posterior exploration” strategy in which the slab parameter is held fixed and is gradually increased along a grid of values. Using the solution from a previous as a “warm start” allows the procedure to more easily find optimal modes. In particular, when , the posterior is convex. Meanwhile, keeping the slab fixed serves to stabilize the large coefficients without applying any shrinkage.

Moran et al. [20] modify this strategy for the unknown case. This is because the posterior is always non-convex when is unknown. In particular, when and , the model can become saturated, causing the residual variance to go to zero. To avoid this suboptimal mode at , Moran et al. [20] recommend fixing until the value at which the algorithm starts to converge in less than 100 iterations. Then, and are simultaneously updated for the next largest in the sequence. The intuition behind this strategy is we first find a solution to the convex problem (in which is fixed) and then use this solution as a warm start for the non-convex problem (in which can vary).

We pursue a similar “dynamic posterior exploration” strategy with the modification for the unknown variance case for the SSGL. A key aspect of this algorithm is how to choose the maximum value of . Ročková and George [26] recommend this maximum to be the value at which the estimated coefficients stabilize. An alternative approach is to choose the maximum using cross-validation, a strategy which is made computationally feasible by the speed of our block coordinate ascent algorithm. In our experience, the dynamic posterior exploration strategy favors more parsimonious models than cross-validation. In the simulation studies in Section 7, we utilize cross-validation to choose , as there, our primary goal is predictive accuracy rather than parsimony.

The entire block coordinate ascent algorithm is given in Algorithm 1.

Input: grid of increasing values , update frequency

[4pt] For :

  1. Set iteration counter

  2. Initialize: , , ,

  3. While diff

    1. Increment

    2. For :

      1. Update

      2. Update

      3. If :

        1. Update

        2. If :

        3. Update

      4. diff

Algorithm 1 Spike-and-Slab Group Lasso

4 Approaches to Inference

While the above procedure allows us to find the posterior mode of , providing inference or a measure of uncertainty around our estimate is a challenging task. One possible way forward is to run MCMC where the algorithm is initialized at the posterior mode estimate. By starting the MCMC chain at the posterior mode, it should converge faster. However, this is still not ideal, as it can be computationally burdensome in high dimensions. Instead, we will adopt ideas from a recent line of research [35, 11]

based on de-biasing estimates from high-dimensional regression models to assess uncertainty quantification. These ideas were derived in the context of lasso regression, and we will explore the extent to which they work for our spike-and-slab group lasso penalty. Define

and let be an approximate inverse of . Adopting notation from [35], we define

(4.1)

In [35], it was shown that this quantity has the following asymptotic distribution:

(4.2)

This assumes that is known. However, our model will provide consistent estimates of , and thus, we can replace the population quantity with the estimate of it from our model. The success of this approach relies heavily on the estimate , so we will utilize the nodewise regression approach described in [17, 35]. For each , let denote the th column of and denote the submatrix of with the th column removed. Define as

Now we can define the components of as for and , and create the following matrix:

Lastly, let , where

We can proceed with . This choice is used because it puts an upper bound on . Other regression models such as the original spike-and-slab lasso [26] could be used instead of the lasso [34] regressions for each covariate. However, we will proceed with this choice, as it has already been studied theoretically and shown to have the required properties to be able to perform inference for .

Let denotes the th coordinate of . Using the modal estimate for under the SSGL prior, , and the estimate obtained from the above procedure, we have from (4.2) that the

asymptotic pointwise confidence intervals for

, are

(4.3)

where and

denotes the cumulative distribution function of

.

To assess the ability of this procedure to obtain accurate confidence intervals (4.3), we run a small simulation study with and , with each of the groups having covariates. We generate the 100 original covariates from a multivariate normal distribution with mean and an AR(1) covariance structure with correlation . The two covariates from each group are the linear and squared term from the original covariates. We set the first seven elements of equal to and the remaining elements equal to zero. Lastly, we try and . Table 1 shows the coverage probabilities across 1000 simulations for both scenarios looked at. We see that important covariates, i.e. covariates with a nonzero corresponding , have coverage close to the nominal level of 95%, while the remaining covariates achieve the nominal level.

Important covariates Null covariates
0.0 0.93 0.95
0.7 0.90 0.95
Table 1: Coverage probabilities for de-biasing simulation.

5 Nonparametric Spike-and-Slab Lasso

We now introduce the nonparametric spike-and-slab lasso (NPSSL). The NPSSL allows for flexible modeling of a response surface with minimal assumptions regarding its functional form. We consider two cases for the NPSSL: (i) a main effects only model, and (ii) a model with both main and interaction effects.

5.1 Main Effects

We first consider the main effects NPSSL model. Here, we assume that the response surface may be decomposed into the sum of univariate functions of each of the covariates. That is, we have the following model:

(5.1)

Following Ravikumar et al. [23], we assume that each , , may be approximated by a linear combination of basis functions , i.e.,

(5.2)

where are the unknown weights. We let denote the matrix with the th entry . Then, (5.1) may be represented in matrix form as

(5.3)

where is a vector of the lower-order truncation bias. Note that we assume the response has been centered and so we do not include a grand mean in (5.3). As we choose to center the response, we do not require the main effects to integrate to zero as in Wei et al. [37]. We do, however, require the matrices , to be orthogonal, as discussed in Section 3.

We assume that depends on only a small number of the covariates so that many of the ’s have a negligible contribution to (5.1). In this case, this is equivalent to assuming that most of the weight vectors have all zero elements. If the th covariate is determined to be predictive of , then has a non-negligible contribution to (5.1). In this case, we want to include the entire basis function approximation to in the model (5.1).

The above situation is a natural fit for the SSGL. We have groups where each group is either included as a whole or not included in the model. The design matrices for each group are exactly the matrices of basis functions, . We will utilize the non-separable SSGL penalty developed in Section 3.2 to enforce this group-sparsity behavior in the model (5.3). More specifically, we seek to maximize the objective function with respect to and :

(5.4)

To find the estimators of and , we use Algorithm 1. Similar additive models for nonparametric regression have been proposed by a number of authors including Ravikumar et al. [23] and Wei et al. [37]. However, our proposed NPSSL method has a number of advantages. Firstly, we allow the noise variance to be unknown, unlike Ravikumar et al. [23]. Accurate estimates of are important to avoid overfitting, which is especially a concern for overparameterized models. Secondly, we use a block-descent algorithm to quickly target the modes of the posterior, whereas Wei et al. [37] utilize MCMC. Thirdly, our SSGL algorithm automatically thresholds negligible groups to zero, negating the need for a post-processing thresholding step.

5.2 Main and Interaction Effects

The main effects model (5.1) allows for each covariate to have a nonlinear contribution to the model, but assumes a linear relationship between the covariates. In some applications, this assumption may be too restrictive. For example, in the environmental exposures data which we analyze in Section 8.2, we may expect that high levels of two toxins may have an even more adverse effect on a person’s health than having high levels of either of the two toxins. Such an effect may be modeled by including interaction effects between the covariates.

In this section, we extend the NPSSL to include interaction effects, as well as main effects. We consider only second-order interactions between the covariates as higher order interactions result in both decreased interpretability and increased computational load. We assume that the interaction effects may be decomposed into the sum of bivariate functions of each pair of covariates, yielding the model:

(5.5)

For the interaction terms, we follow Wei et al. [37] and approximate using the outer product of the basis functions of the interacting covariates:

(5.6)

where is the vector of unknown weights. We let denote the matrix with rows

where . Then, (5.5) may be represented in matrix form as

(5.7)

where is again a vector of the lower-order truncation bias. We again assume has been centered and so do not include a grand mean in (5.7). We do not constrain to integrate to zero as in Wei et al. [37]. However, we do ensure that the main effects are not in the linear span of the interaction functions. That is, we require the “main effect” matrices and to be orthogonal to the “interaction” matrix . This condition is needed to maintain identifiability between the main and interaction effects.

In this interaction model, we either include in the model (5.7) if there is a non-negligible interaction between the th and th covariates, or we estimate if such an interaction is negligible. With the non-separable SSGL penalty, the objective function is:

(5.8)

where We can again use Algorithm 1 to find the modal estimates of and .

6 Asymptotic Theory for the SSGL and NPSSL

In this section, we derive asymptotic properties for the separable SSGL and NPSSL models. We first note some differences between our theory and the theory in Ročková and George [26]. First, we prove joint consistency in estimation of both the unknown and the unknown , whereas [26] proved their result only for , assuming known variance . Second, Ročková and George [26] established optimal posterior concentration rates for the global posterior mode and the full posterior separately. We establish optimal posterior contraction for the full posterior only, from which it immediately follows that the global posterior mode approaches the truth at the same rate. Finally, we use proof techniques from [21, 29, 37] rather than the ones in [26]. However, none of these other papers considers both continuous spike-and-slab priors for groups of regression coefficients and an independent prior on the unknown variance.

6.1 Grouped Linear Regression

We work under the frequentist assumption that there is a true model,

(6.1)

where and . Denote and . Suppose we endow under model (6.1) with the following prior specification: