# Fast Algorithms and Theory for High-Dimensional Bayesian Varying Coefficient Models

Nonparametric varying coefficient (NVC) models are widely used for modeling time-varying effects on responses that are measured repeatedly. In this paper, we introduce the nonparametric varying coefficient spike-and-slab lasso (NVC-SSL) for Bayesian estimation and variable selection in NVC models. The NVC-SSL simultaneously estimates the functionals of the significant time-varying covariates while thresholding out insignificant ones. Our model can be implemented using a highly efficient expectation-maximization (EM) algorithm, thus avoiding the computational burden of Markov chain Monte Carlo (MCMC) in high dimensions. In contrast to frequentist NVC models, hardly anything is known about the large-sample properties for Bayesian NVC models. In this paper, we take a step towards addressing this longstanding gap between methodology and theory by deriving posterior contraction rates under the NVC-SSL model when the number of covariates grows at nearly exponential rate with sample size. Finally, we illustrate our methodology through simulation studies and data analysis.

## Authors

• 12 publications
• 4 publications
• 45 publications
03/05/2019

### Spike-and-Slab Group Lassos for Grouped Regression and Sparse Generalized Additive Models

We introduce the spike-and-slab group lasso (SSGL) for Bayesian estimati...
07/14/2020

### A Unified Computational and Theoretical Framework for High-Dimensional Bayesian Additive Models

We introduce a general framework for estimation and variable selection i...
09/13/2020

### Bayesian modelling of time-varying conditional heteroscedasticity

Conditional heteroscedastic (CH) models are routinely used to analyze fi...
12/27/2019

### Minorization-Maximization-based Steepest Ascent for Large-scale Survival Analysis with Time-Varying Effects: Application to the National Kidney Transplant Dataset

The time-varying effects model is a flexible and powerful tool for model...
02/27/2021

### Time-Varying Coefficient Model Estimation Through Radial Basis Functions

In this paper we estimate the dynamic parameters of a time-varying coeff...
10/25/2020

### Latent Network Estimation and Variable Selection for Compositional Data via Variational EM

Network estimation and variable selection have been extensively studied ...
09/29/2021

### Adaptive Bayesian Sum of Trees Model for Covariate Dependent Spectral Analysis

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Consider the nonparametric varying coefficient (NVC) model with covariates,

 yi(tij)=p∑k=1xik(tij)βk(tij)+εi(tij),i=1,…,n,j=1,…,ni, (1.1)

where is the response for the th subject at time point , is the time interval on which the different measurements are taken, is a time-dependent covariate with corresponding smooth coefficient function , and is random error. Throughout this paper, we denote as the total number of observations. To avoid the need for an intercept, we assume the within-subject responses have been centered, i.e. . We also assume that the ’s are independent, zero-mean Gaussian processes. That is, we assume that , where

is the variance-covariance matrix that captures the temporal correlation between the

responses, , for the th subject.

NVC models (1.1) arise in many real applications. A prominent example is in longitudinal data analysis where we aim to model the response for the th experimental subject at different time points [14]. NVC models can also be used for functional data analysis where we wish to model smooth functional responses varying over a continuum [29]. See [12, 9] for examples of applications of these models. We note that while we model the response as a function of time in this manuscript, the model (1.1) can also be used more broadly to model functions of any “effect modifier” variable , such as spatial location, wavelength, etc. [12].

There has been extensive frequentist work on fitting NVC models. Typical approaches to fitting (1.1) use kernel-local polynomial smoothing [8, 36] or basis expansions [16, 28] to estimate . When the number of covariates is large, one also often wants to impose a low-dimensional structure such as sparsity. In order to perform simultaneous function estimation and model selection, many authors have applied a penalty such as group SCAD [7] or group lasso [38]

to the vectors of basis coefficients. See, e.g.

[33, 34, 35]. These frequentist penalized NVC models do not account for the within-subject temporal correlations, essentially solving objective functions with . In low-dimensional settings and without regularizing the parameter space, Krafty et al. [20] incorporated estimation of within-subject correlations into NVC models. However, to the best of our knowledge, no similar extension has been made for high-dimensional, penalized NVC models. While [33, 34, 35] show that consistent estimation of the ’s and model selection consistency can still be achieved for penalized NVC models, failing to account for the error variances can nevertheless lead to invalid inferences and overfitting in finite samples [24, 26]. Thus, it seems prudent to model temporal dependence in NVC models.

While there are numerous theoretical results for frequentist NVC models, work on Bayesian NVC models has been primarily methodological. For example, Liu et al. [25] endow the smooth functions ’s with a Gaussian process prior. Biller and Fahrmeir [3] and Huang et al. [17] use splines to model the ’s in (1.1) and place multivariate normal priors on the groups of basis coefficients. Li et al. [22] place a scale-mixture of a multivariate normal priors known as the Bayesian group lasso prior on groups of basis coefficients. Unlike the frequentist penalized approaches, [25, 22] explicitly model the temporal dependence of the within-subject measurements by either including subject-specific random effects or by specifying a first-order autoregressive (AR(1)) covariance structure for the error terms . In spite of the benefits of being able to incorporate covariance structure into variable selection, existing Bayesian approaches to NVCs predominantly rely on Markov chain Monte Carlo (MCMC) to obtain posterior estimates of the ’s. In high dimensions, however, MCMC can be very slow and even computationally impractical. In addition, hardly anything is known about the theoretical properties for Bayesian NVC models.

In this paper, we take a step towards addressing these methodological, computational, and theoretical gaps by introducing the nonparametric varying coefficient spike-and-slab lasso (NVC-SSL). Our main contributions are as follows:

• We introduce a spike-and-slab approach for Bayesian estimation and variable selection in nonparametric varying coefficient models (1.1). Unlike penalized frequentist approaches, the NVC-SSL model explicitly accounts for temporal correlations by jointly estimating the unknown covariance structure for the time-varying responses. The NVC-SSL model also employs a non-separable Bayesian penalty which automatically self-adapts to ensemble information about sparsity.

• We derive an expectation/maximization (EM) algorithm to rapidly select and estimate the nonzero smooth functional components, while thresholding out the insignificant ones. Our paper appears to be the first to bypass MCMC in its implementation of a Bayesian NVC model.

• We provide the first theoretical results for Bayesian varying coefficient models when . Specifically, we derive posterior contraction rates for the functional components when the number of time-dependent covariates grows at nearly exponential rate with .

The rest of this paper is structured as follows. In Section 2, we introduce the nonparametric varying coefficient spike-and-slab lasso. In Section 3, we derive a coordinate ascent algorithm for rapidly obtaining estimates of the smooth functionals , under the NVC-SSL. In Section 4, we provide asymptotic theory for the NVC model (1.1) under the NVC-SSL prior when . In Section 5, we provide simulation studies of our method. Finally, in Section 6, we use our model to analyze a real data set.

### 1.1 Notation

We use the following notations for the rest of the paper. For two nonnegative sequences and , we write to denote . If , we write or . We use or to denote that for sufficiently large , there exists a constant independent of such that . We write to denote .

For a vector , we let and denote the and norms respectively. For a symmetric matrix , we let and

denote its minimum and maximum eigenvalues respectively. For a matrix

with entries , denotes its Frobenius norm, while denotes its spectral norm.

## 2 The Nonparametric Varying Coefficient Spike-and-Slab Lasso

### 2.1 Basis Expansion and the NVC-SSL

Following the development in [33, 34, 35], we suppose that each coefficient function in (1.1) can be approximated by , a linear combination of basis functions, i.e.

 gk(t)=dk∑l=1γklBkl(t),t∈T,k=1,…,p, (2.1)

where , are the basis functions. Then the model (1.1) can be approximated as

 yi(tij)≈p∑k=1dk∑l=1xik(tij)γklBkl(tij)+εi(tij), (2.2)

for and . Let , with

 Xk=(x1k(t11),…,x1k(t1n1),…,xnk(tn1),…,xnk(tnnn))′. (2.3)

Further, we define as

 B(t)=⎛⎜ ⎜⎝B11(t)B12(t)…B1d1(t)0…00…0⋮⋮⋮⋮⋮⋮⋮00…00…Bp1(t)Bp2(t)…Bpdp(t)⎞⎟ ⎟⎠, (2.4)

and set with

 U′ij=x′(tij)B(tij) (2.5)

for , where denotes the row of corresponding to the th observation for the th subject.

Letting and , the model (1.1) can then be expressed in matrix form as

 (2.6)

where and , is the -dimensional vector of basis coefficients corresponding to the th covariate. is an block diagonal matrix, and is an vector of lower-order bias, or the approximation error from using truncated bases of dimension to approximate the ’s.

### 2.2 Model Formulation

For the NVC model (1.1), we assume that the within-subject covariance matrices have the structure, , where denotes that the correlation matrix is determined by a single parameter . That is, we suppose that for ,

 εiind∼Nni(0,σ2Ri(ρ)),i=1,…,n. (2.7)

This general form subsumes many popular choices for covariance structures. For example, if we assume first-order autoregressive (AR(1)) structure, then the th element of is . For compound symmetry (CS), the th element of is . For concreteness, we focus only on AR(1) and CS structures in this paper, noting that our model can be generalized to more exotic correlation structures. This assumption about the structure of the ’s makes our estimation procedure more computationally efficient and less prone to overfitting, as the problem of estimating unknowns (the number of diagonal and unique off-diagonal entries in the ’s) reduces to just estimating two unknowns .

Under the NVC-SSL model, we endow the vector of basis coefficients in (2.6) with the spike-and-slab group lasso (SSGL) prior of [2],

 π(γ|θ)=p∏k=1[(1−θ)Ψ(γk|λ0)+θΨ(γk|λ1)], (2.8)

where is a mixing proportion, or the expected proportion of nonzero ’s, and

denotes the group lasso density indexed by hyperparameter

,

The group lasso prior has been considered by several other authors [2, 25, 21, 37] and can be derived as the marginal density of a multivariate normal scale-mixture density, , .

The SSGL prior (2.8), which we denote as going forward, can be considered a two-group refinement of the group lasso [38]. Under the prior (2.8), the global posterior mode for may be exactly sparse, thereby allowing the to perform joint estimation and variable selection [2]. In the present context, if the posterior mode , then the th functional component will be estimated as and thus thresholded out of the model. We typically set in (2.8), so that the first mixture component (the spike) is heavily concentrated near for each , while the slab stabilizes the posterior estimates of large coefficients, preventing them from being downward biased.

To model the uncertainty in in (2.8), we endow with a beta prior,

 θ∼B(a,b), (2.9)

where are fixed positive constants. Unlike frequentist penalties such as group SCAD or group lasso, this prior on ultimately renders our Bayesian penalty non-separable. The non-separability provides several benefits. First, the prior on allows to self-adapt to the true underlying sparsity. Second, with appropriate choices for the hyperparameters in , namely , our prior performs an automatic multiplicity adjustment [32] and favors parsimonious models in high dimensions.

To complete the model specification, we place independent priors on the parameters in (2.7) as

 σ2∼IG(c0/2,d0/2), (2.10)

where are small positive constants, and

 π(ρ)=q∑h=1q−1δmh. (2.11)

That is,

follows a discrete uniform distribution with

atoms where . In our implementation, we specify the support for as . As we describe in Section 3, placing a discrete uniform prior on facilitates more efficient computations (from an optimization perspective) than a continuous prior with bounded support.

### 2.3 Determining Which Covariance Structure to Use

Our model requires the user to specify the error covariance structure (AR(1) or CS). In order to determine which one to use, we can first obtain the residuals from a regression fit (e.g. a regression with the standard group lasso). Then we can construct empirical variogram plots, time-plots, or scatterplot matrices of the residuals to give us an idea of the underlying error covariance structure [5].

In particular, the empirical variogram of the residuals (see Chapter 5 of [5]) is widely used in practice to determine the appropriate covariance structure to use for modeling longitudinal and spatial data. If the residual variogram shows decaying correlation with distance, then we may specify the AR(1) structure for the NVC-SSL model. On the other hand, if the residual variogram suggests equicorrelation, then we may specify the CS structure.

## 3 Computational Strategy

### 3.1 Posterior Mode Estimation

We now detail how to implement the NVC-SSL model. Rather than relying on MCMC, we will target the maximum a posteriori (MAP) estimate for the basis coefficients, . We may then take as our estimates for the smooth functionals as . For simplicity, we let the basis truncation parameters satisfy for some positive integer . In Section 3.3, we describe how to select .

Let denote the collection . The log-posterior density for (up to an additive constant) is given by

 logπ(Ξ|Y)= −N2logσ2−12log|R(ρ)|−∥R−1/2(ρ)(Y−Uγ)∥222σ2 +p∑k=1log((1−θ)λd0e−λ0∥γk∥2+θλd1e−λ1∥γk∥2) +(a−1)logθ+(b−1)log(1−θ) −(c0+22)logσ2−d02σ2+logπ(ρ). (3.1)

Our objective is to maximize the log-posterior with respect to . We first introduce latent 0-1 indicators, . Then we reparametrize the prior (2.8) as:

 π(γ|τ)=p∏k=1[(1−τk)Ψ(γk|λ0)+τkΨ(γk|λ1)], π(τ|θ)=p∏k=1θτk(1−θ)1−τk.

Let . The augmented log-posterior density for (up to an additive constant) is now given by

 logπ(Ξ,τ|Y)= −N2logσ2−12log|R(ρ)|−∥R−1/2(ρ)(Y−Uγ)∥222σ2 +p∑k=1log((1−τk)λd0e−λ0∥γk∥2+τkλd1e−λ1∥γk∥2) +(a−1+p∑k=1τk)logθ+(b−1+p−p∑k=1τk)log(1−θ) −(c0+22)logσ2−d02σ2+logπ(ρ). (3.2)

It is straightforward to verify that , where

 p⋆k(γk,θ)=θΨ(γk|λ1)θΨ(γk|λ1)+(1−θ)Ψ(γk|λ0) (3.3)

is the conditional posterior probability that

is drawn from the slab distribution rather than from the spike.

With the augmented log-posterior (3.1), we may now implement an EM algorithm to find . After initializing the parameters , we iterate between the E-step and M-step until convergence. For the E-step, we compute , given the previous estimate . For the M-step, we then maximize the following objective function with respect to :

 E[log(Ξ|Y)|Ξ(t−1)] =−N2logσ2−12log|R(ρ)|−∥R−1/2(ρ)(Y−Uγ)∥222σ2−p∑k=1λ⋆k∥γk∥2 +(a−1+p∑k=1p⋆k)logθ+(b−1+p−p∑k=1p⋆k)log(1−θ) −(c0+22)logσ2−d02σ2+logπ(ρ), (3.4)

where . The function (3.1) would be difficult to jointly maximize with respect to if were a continuous density. However, by endowing with a discrete uniform prior (2.11), our optimization is much simpler. With the prior (2.11), we may fix and maximize (3.1) with respect to for each atom. In our implementation, each of these optimizations is performed in parallel and the log-posterior for each , is evaluated. We take as our modal estimate the to be the which maximizes , the original non-augmented log-posterior (3.1).

For either fixed or random , it is clear from (3.1) that has the following closed form update in the M-step:

 θ(t)=a−1+∑pk=1p⋆ka+b+p−2. (3.5)

Next, we update , holding fixed. Let and . To update , we solve the following optimization:

 γ(t)=argmaxγ−12∥˜Y−˜Uγ∥22−p∑k=1σ2λ⋆k∥γk∥2. (3.6)

Note that (3.6) is an adaptive group lasso problem with weights , and it explicitly takes temporal correlation into account (through ) in our estimate procedure for . This optimization can be solved with any standard (adaptive) group lasso algorithm [38, 13].

Finally, holding fixed, we update , which has the following closed form:

 σ2(t)=d0+∥˜Y−˜Uγ∥22N+c0+2. (3.7)

In order to obtain and and evaluate the log posterior (3.1) for each atom , in (2.11), we must invert , which is an matrix. However, by exploiting the block structure of , we only need to perform matrix inversions of the individual correlation matrices , incurring total computational complexity of . Similarly, evaluating the log-determinant requires operations. In high dimensions, is typically smaller than both and , so performing these operations is not particularly costly. To further improve computational efficiency, we compute the inverses and log-determinants of the ’s in parallel. Letting and denote the subvector and submatrix of and corresponding to the th subject respectively, we also compute , and in parallel and then combine them into a single and respectively.

### 3.2 Initialization and Dynamic Posterior Exploration

The posterior distribution under the NVC-SSL prior will typically be multimodal when and , and thus, any MAP finding algorithm is prone to becoming entrapped at a suboptimal local mode. To partially mitigate this, we initialize our EM algorithm at , where is the group lasso solution without accounting for any temporal correlations. That is,

 γ(0)=argmaxγ12∥Y−Uγ∥22+λp∑k=1√d∥γk∥2,

where is chosen from ten-fold cross-validation. The initial unknown variance is also taken to be . In our experience, this initialization works well in ensuring rapid convergence to a fairly good solution, even though we are not guaranteed to find the global mode. Even if we were to implement our model using MCMC, the model with the highest posterior probability may only be visited a small number of times or may never even be visited at all (see, e.g. Section 4.3 of [30]).

In addition to choosing a good initialization, we employ dynamic posterior exploration to increase the chances of finding more optimal modes [31, 30, 2] . We fix to be a small constant so that the slab density has considerable spread. Having a diffuse slab allows vectors with large coefficients to escape the pull of the spike. If we also set to be small, then the posterior will be relatively flat. However, with a diffuse spike and slab density, many negligible ’s will tend to be selected in our model. To eliminate these suboptimal non-sparse modes, we then gradually increase along a ladder of increasing values, . For each in the ladder, we reinitialize using the MAP estimate from the previous spike parameter as a “warm start.” As we increase , the posterior becomes “spikier,” with the spikes absorbing more and more negligible parameter estimates. For large enough , the algorithm will eventually stabilize so that further increases in do not change the solution.

The complete algorithm for the NVC-SSL model is given in Algorithm 1. Let be the vector of all observation times for all subjects. Once we have gotten the final estimate , we can obtain the estimates for the smooth functionals as , where is a vector of evaluated at all time points in . We reiterate that Step 3 of the algorithm is also computed in parallel for each in order to accelerate computing time.

### 3.3 Selection of Degrees of Freedom

By utilizing dynamic posterior exploration, we do not need to tune the hyperparameter

. However, we still need to choose the degrees of freedom

(i.e. the number of basis functions to use). To do this, we use the Akaike information criterion with a correction for small sample sizes () [18]. This correction ensures that if the sample size is small, will be reluctant to overfit. Let denote the indices of the estimated nonzero subvectors of , with cardinality . In this context, the is defined as

 AICc=log(∥R−1/2(ˆρ)(Y−Uˆγ)∥22N)+1+2(ˆs+1)N−ˆs−2, (3.8)

where are the modal estimates under the NVC-SSL prior. Note that if were known, then the first term in (3.8) would be the maximum likelihood estimate (MLE) for . We select the which minimizes from a reasonable range of values.

Note that in our numerical studies, we also found that simply fixing to be sufficiently large (e.g., ) gave excellent performance under the NVC-SSL model. Further tuning of using provided only modest improvements. This could possibly be attributed to the fact that the dynamic posterior exploration strategy from Section 3.2 already eliminates many spurious variables. In our simulations in Section 5, we use to select in order to make our comparisons with competing frequentist methods more transparent (where was also similarly tuned). However, in practice, fitting the model with large enough works quite well.

## 4 Asymptotic Theory for the NVC-SSL

In this section, we prove several asymptotic properties about the NVC-SSL model. To the very best of our knowledge, these are the first theoretical results for Bayesian NVC models when . We assume that there is a “true” model,

 yi(tij)=p∑k=1xik(tij)β0k(tij)+εi(tij),i=1,…,n,j=1,…,ni, (4.1)

where for fixed and . We let denote the probability measure underlying the true model (4.1). As before, suppose that each in (4.1) can be approximated by a linear combination of basis functions,

 g0k(t)=dk∑i=1γ0klBkl(t),t∈T,k=1,…,p, (4.2)

where is a given basis system. Throughout this section, we assume that the basis functions are B-splines, as they are widely used in practice and the theory for B-splines is quite well-established. For simplicity, we also assume that , noting that this can be relaxed. Our theory will continue to hold if we allow different vector sizes ’s, provided that we add the additional constraint that , as in [16]. Our results would also then be expressed in terms of instead of (see, e.g. Corollary 1 of [35]). For each , let the approximation error be given by

 α0k(t)=β0k(t)−g0k(t)=β0k(t)−d∑l=1γ0klBkl(t),t∈T,k=1,…,p,

and so model (4.1) can be written as

 yi(tij)=p∑k=1d∑l=1xik(tij)γ0klBkl(tij)+p∑k=1xik(tij)α0k(tij)+εi(tij). (4.3)

Let be an vector with entries, . In matrix form, (4.1) can be expressed as

 (4.4)

where is defined as in (2.5), where is a -dimensional vector of the true basis coefficients corresponding to the th covariate, and .

For our theoretical study of the NVC-SSL model, we endow in (4.4) with the prior,

 γ∼SSGL(λ0,λ1,θ),σ2∼IG(c0/2,d0/2),ρ∼U(0,1) (4.5)

where , and . Our theory builds upon recent work by [2] who studied posterior contraction rates for additive regression under the SSGL prior (2.8) and [19] who introduced a unified theoretical framework for sparse Bayesian regression models. Our work differs from [2] in that we no longer assume homoscedastic, uncorrelated errors, i.e. . Our work also differs from [19] in that we use a continuous spike-and-slab prior that penalizes groups of coefficients, whereas [19] employ a point-mass spike-and-slab prior with a univariate Laplace density as the slab to penalize individual coefficients.

###### Remark 1.

For practical implementation of the NVC-SSL model, we endowed with a discrete uniform prior. The prior we study here, i.e. , can be obtained as the limiting case when the number of atoms in (2.11) . Our theoretical results can also be shown to hold for the discrete uniform prior by making a few minor modifications to our proofs.

### 4.1 Dimensionality Recovery for the NVC-SSL Model

We first begin with a result on dimensionality. Under our formulation (4.1), determining the number of nonzero functions is equivalent to determining the number of -dimensional vectors such that . Since we used a continuous spike-and-slab prior in our prior formulation (4.5), our model assigns zero mass to exactly sparse vectors . To approximate the model size under the NVC-SSL model, we use the following generalized notion of sparsity [2]. For a small constant which depends on , we define the generalized inclusion indicator and generalized dimensionality, respectively, as

 νωd(γk)=I(∥γk∥2>ωd) and |ν(γ)|=p∑k=1νωd(γ). (4.6)

For the threshold , we use the following:

 ωd≡ωd(λ0,λ1,θ)=1λ0−λ1log[1−θθλd0λd1]. (4.7)

As noted in [2], any -dimensional vectors satisfying represent the intersection points between the spike and the slab densities in the prior. For large , the threshold rapidly approaches zero as increases, so that provides a good approximation to .

We first state the following regularity assumptions. We denote as the parameter space for . Let denote the set of indices of the true nonzero functions in (4.1), with cardinality . Let denote the maximum number of within-subject observations.

1. , ,