1 Introduction
Consider the nonparametric varying coefficient (NVC) model with covariates,
(1.1) 
where is the response for the th subject at time point , is the time interval on which the different measurements are taken, is a timedependent covariate with corresponding smooth coefficient function , and is random error. Throughout this paper, we denote as the total number of observations. To avoid the need for an intercept, we assume the withinsubject responses have been centered, i.e. . We also assume that the ’s are independent, zeromean Gaussian processes. That is, we assume that , where
is the variancecovariance matrix that captures the temporal correlation between the
responses, , for the th subject.NVC models (1.1) arise in many real applications. A prominent example is in longitudinal data analysis where we aim to model the response for the th experimental subject at different time points [14]. NVC models can also be used for functional data analysis where we wish to model smooth functional responses varying over a continuum [29]. See [12, 9] for examples of applications of these models. We note that while we model the response as a function of time in this manuscript, the model (1.1) can also be used more broadly to model functions of any “effect modifier” variable , such as spatial location, wavelength, etc. [12].
There has been extensive frequentist work on fitting NVC models. Typical approaches to fitting (1.1) use kernellocal polynomial smoothing [8, 36] or basis expansions [16, 28] to estimate . When the number of covariates is large, one also often wants to impose a lowdimensional structure such as sparsity. In order to perform simultaneous function estimation and model selection, many authors have applied a penalty such as group SCAD [7] or group lasso [38]
to the vectors of basis coefficients. See, e.g.
[33, 34, 35]. These frequentist penalized NVC models do not account for the withinsubject temporal correlations, essentially solving objective functions with . In lowdimensional settings and without regularizing the parameter space, Krafty et al. [20] incorporated estimation of withinsubject correlations into NVC models. However, to the best of our knowledge, no similar extension has been made for highdimensional, penalized NVC models. While [33, 34, 35] show that consistent estimation of the ’s and model selection consistency can still be achieved for penalized NVC models, failing to account for the error variances can nevertheless lead to invalid inferences and overfitting in finite samples [24, 26]. Thus, it seems prudent to model temporal dependence in NVC models.While there are numerous theoretical results for frequentist NVC models, work on Bayesian NVC models has been primarily methodological. For example, Liu et al. [25] endow the smooth functions ’s with a Gaussian process prior. Biller and Fahrmeir [3] and Huang et al. [17] use splines to model the ’s in (1.1) and place multivariate normal priors on the groups of basis coefficients. Li et al. [22] place a scalemixture of a multivariate normal priors known as the Bayesian group lasso prior on groups of basis coefficients. Unlike the frequentist penalized approaches, [25, 22] explicitly model the temporal dependence of the withinsubject measurements by either including subjectspecific random effects or by specifying a firstorder autoregressive (AR(1)) covariance structure for the error terms . In spite of the benefits of being able to incorporate covariance structure into variable selection, existing Bayesian approaches to NVCs predominantly rely on Markov chain Monte Carlo (MCMC) to obtain posterior estimates of the ’s. In high dimensions, however, MCMC can be very slow and even computationally impractical. In addition, hardly anything is known about the theoretical properties for Bayesian NVC models.
In this paper, we take a step towards addressing these methodological, computational, and theoretical gaps by introducing the nonparametric varying coefficient spikeandslab lasso (NVCSSL). Our main contributions are as follows:

We introduce a spikeandslab approach for Bayesian estimation and variable selection in nonparametric varying coefficient models (1.1). Unlike penalized frequentist approaches, the NVCSSL model explicitly accounts for temporal correlations by jointly estimating the unknown covariance structure for the timevarying responses. The NVCSSL model also employs a nonseparable Bayesian penalty which automatically selfadapts to ensemble information about sparsity.

We derive an expectation/maximization (EM) algorithm to rapidly select and estimate the nonzero smooth functional components, while thresholding out the insignificant ones. Our paper appears to be the first to bypass MCMC in its implementation of a Bayesian NVC model.

We provide the first theoretical results for Bayesian varying coefficient models when . Specifically, we derive posterior contraction rates for the functional components when the number of timedependent covariates grows at nearly exponential rate with .
The rest of this paper is structured as follows. In Section 2, we introduce the nonparametric varying coefficient spikeandslab lasso. In Section 3, we derive a coordinate ascent algorithm for rapidly obtaining estimates of the smooth functionals , under the NVCSSL. In Section 4, we provide asymptotic theory for the NVC model (1.1) under the NVCSSL prior when . In Section 5, we provide simulation studies of our method. Finally, in Section 6, we use our model to analyze a real data set.
1.1 Notation
We use the following notations for the rest of the paper. For two nonnegative sequences and , we write to denote . If , we write or . We use or to denote that for sufficiently large , there exists a constant independent of such that . We write to denote .
For a vector , we let and denote the and norms respectively. For a symmetric matrix , we let and
denote its minimum and maximum eigenvalues respectively. For a matrix
with entries , denotes its Frobenius norm, while denotes its spectral norm.2 The Nonparametric Varying Coefficient SpikeandSlab Lasso
2.1 Basis Expansion and the NVCSSL
Following the development in [33, 34, 35], we suppose that each coefficient function in (1.1) can be approximated by , a linear combination of basis functions, i.e.
(2.1) 
where , are the basis functions. Then the model (1.1) can be approximated as
(2.2) 
for and . Let , with
(2.3) 
Further, we define as
(2.4) 
and set with
(2.5) 
for , where denotes the row of corresponding to the th observation for the th subject.
Letting and , the model (1.1) can then be expressed in matrix form as
(2.6) 
where and , is the dimensional vector of basis coefficients corresponding to the th covariate. is an block diagonal matrix, and is an vector of lowerorder bias, or the approximation error from using truncated bases of dimension to approximate the ’s.
2.2 Model Formulation
For the NVC model (1.1), we assume that the withinsubject covariance matrices have the structure, , where denotes that the correlation matrix is determined by a single parameter . That is, we suppose that for ,
(2.7) 
This general form subsumes many popular choices for covariance structures. For example, if we assume firstorder autoregressive (AR(1)) structure, then the th element of is . For compound symmetry (CS), the th element of is . For concreteness, we focus only on AR(1) and CS structures in this paper, noting that our model can be generalized to more exotic correlation structures. This assumption about the structure of the ’s makes our estimation procedure more computationally efficient and less prone to overfitting, as the problem of estimating unknowns (the number of diagonal and unique offdiagonal entries in the ’s) reduces to just estimating two unknowns .
Under the NVCSSL model, we endow the vector of basis coefficients in (2.6) with the spikeandslab group lasso (SSGL) prior of [2],
(2.8) 
where is a mixing proportion, or the expected proportion of nonzero ’s, and
denotes the group lasso density indexed by hyperparameter
,The group lasso prior has been considered by several other authors [2, 25, 21, 37] and can be derived as the marginal density of a multivariate normal scalemixture density, , .
The SSGL prior (2.8), which we denote as going forward, can be considered a twogroup refinement of the group lasso [38]. Under the prior (2.8), the global posterior mode for may be exactly sparse, thereby allowing the to perform joint estimation and variable selection [2]. In the present context, if the posterior mode , then the th functional component will be estimated as and thus thresholded out of the model. We typically set in (2.8), so that the first mixture component (the spike) is heavily concentrated near for each , while the slab stabilizes the posterior estimates of large coefficients, preventing them from being downward biased.
To model the uncertainty in in (2.8), we endow with a beta prior,
(2.9) 
where are fixed positive constants. Unlike frequentist penalties such as group SCAD or group lasso, this prior on ultimately renders our Bayesian penalty nonseparable. The nonseparability provides several benefits. First, the prior on allows to selfadapt to the true underlying sparsity. Second, with appropriate choices for the hyperparameters in , namely , our prior performs an automatic multiplicity adjustment [32] and favors parsimonious models in high dimensions.
To complete the model specification, we place independent priors on the parameters in (2.7) as
(2.10) 
where are small positive constants, and
(2.11) 
That is,
follows a discrete uniform distribution with
atoms where . In our implementation, we specify the support for as . As we describe in Section 3, placing a discrete uniform prior on facilitates more efficient computations (from an optimization perspective) than a continuous prior with bounded support.2.3 Determining Which Covariance Structure to Use
Our model requires the user to specify the error covariance structure (AR(1) or CS). In order to determine which one to use, we can first obtain the residuals from a regression fit (e.g. a regression with the standard group lasso). Then we can construct empirical variogram plots, timeplots, or scatterplot matrices of the residuals to give us an idea of the underlying error covariance structure [5].
In particular, the empirical variogram of the residuals (see Chapter 5 of [5]) is widely used in practice to determine the appropriate covariance structure to use for modeling longitudinal and spatial data. If the residual variogram shows decaying correlation with distance, then we may specify the AR(1) structure for the NVCSSL model. On the other hand, if the residual variogram suggests equicorrelation, then we may specify the CS structure.
3 Computational Strategy
3.1 Posterior Mode Estimation
We now detail how to implement the NVCSSL model. Rather than relying on MCMC, we will target the maximum a posteriori (MAP) estimate for the basis coefficients, . We may then take as our estimates for the smooth functionals as . For simplicity, we let the basis truncation parameters satisfy for some positive integer . In Section 3.3, we describe how to select .
Let denote the collection . The logposterior density for (up to an additive constant) is given by
(3.1) 
Our objective is to maximize the logposterior with respect to . We first introduce latent 01 indicators, . Then we reparametrize the prior (2.8) as:
Let . The augmented logposterior density for (up to an additive constant) is now given by
(3.2) 
It is straightforward to verify that , where
(3.3) 
is the conditional posterior probability that
is drawn from the slab distribution rather than from the spike.With the augmented logposterior (3.1), we may now implement an EM algorithm to find . After initializing the parameters , we iterate between the Estep and Mstep until convergence. For the Estep, we compute , given the previous estimate . For the Mstep, we then maximize the following objective function with respect to :
(3.4) 
where . The function (3.1) would be difficult to jointly maximize with respect to if were a continuous density. However, by endowing with a discrete uniform prior (2.11), our optimization is much simpler. With the prior (2.11), we may fix and maximize (3.1) with respect to for each atom. In our implementation, each of these optimizations is performed in parallel and the logposterior for each , is evaluated. We take as our modal estimate the to be the which maximizes , the original nonaugmented logposterior (3.1).
For either fixed or random , it is clear from (3.1) that has the following closed form update in the Mstep:
(3.5) 
Next, we update , holding fixed. Let and . To update , we solve the following optimization:
(3.6) 
Note that (3.6) is an adaptive group lasso problem with weights , and it explicitly takes temporal correlation into account (through ) in our estimate procedure for . This optimization can be solved with any standard (adaptive) group lasso algorithm [38, 13].
Finally, holding fixed, we update , which has the following closed form:
(3.7) 
In order to obtain and and evaluate the log posterior (3.1) for each atom , in (2.11), we must invert , which is an matrix. However, by exploiting the block structure of , we only need to perform matrix inversions of the individual correlation matrices , incurring total computational complexity of . Similarly, evaluating the logdeterminant requires operations. In high dimensions, is typically smaller than both and , so performing these operations is not particularly costly. To further improve computational efficiency, we compute the inverses and logdeterminants of the ’s in parallel. Letting and denote the subvector and submatrix of and corresponding to the th subject respectively, we also compute , and in parallel and then combine them into a single and respectively.
3.2 Initialization and Dynamic Posterior Exploration
The posterior distribution under the NVCSSL prior will typically be multimodal when and , and thus, any MAP finding algorithm is prone to becoming entrapped at a suboptimal local mode. To partially mitigate this, we initialize our EM algorithm at , where is the group lasso solution without accounting for any temporal correlations. That is,
where is chosen from tenfold crossvalidation. The initial unknown variance is also taken to be . In our experience, this initialization works well in ensuring rapid convergence to a fairly good solution, even though we are not guaranteed to find the global mode. Even if we were to implement our model using MCMC, the model with the highest posterior probability may only be visited a small number of times or may never even be visited at all (see, e.g. Section 4.3 of [30]).
In addition to choosing a good initialization, we employ dynamic posterior exploration to increase the chances of finding more optimal modes [31, 30, 2] . We fix to be a small constant so that the slab density has considerable spread. Having a diffuse slab allows vectors with large coefficients to escape the pull of the spike. If we also set to be small, then the posterior will be relatively flat. However, with a diffuse spike and slab density, many negligible ’s will tend to be selected in our model. To eliminate these suboptimal nonsparse modes, we then gradually increase along a ladder of increasing values, . For each in the ladder, we reinitialize using the MAP estimate from the previous spike parameter as a “warm start.” As we increase , the posterior becomes “spikier,” with the spikes absorbing more and more negligible parameter estimates. For large enough , the algorithm will eventually stabilize so that further increases in do not change the solution.
The complete algorithm for the NVCSSL model is given in Algorithm 1. Let be the vector of all observation times for all subjects. Once we have gotten the final estimate , we can obtain the estimates for the smooth functionals as , where is a vector of evaluated at all time points in . We reiterate that Step 3 of the algorithm is also computed in parallel for each in order to accelerate computing time.
3.3 Selection of Degrees of Freedom
By utilizing dynamic posterior exploration, we do not need to tune the hyperparameter
. However, we still need to choose the degrees of freedom
(i.e. the number of basis functions to use). To do this, we use the Akaike information criterion with a correction for small sample sizes () [18]. This correction ensures that if the sample size is small, will be reluctant to overfit. Let denote the indices of the estimated nonzero subvectors of , with cardinality . In this context, the is defined as(3.8) 
where are the modal estimates under the NVCSSL prior. Note that if were known, then the first term in (3.8) would be the maximum likelihood estimate (MLE) for . We select the which minimizes from a reasonable range of values.
Note that in our numerical studies, we also found that simply fixing to be sufficiently large (e.g., ) gave excellent performance under the NVCSSL model. Further tuning of using provided only modest improvements. This could possibly be attributed to the fact that the dynamic posterior exploration strategy from Section 3.2 already eliminates many spurious variables. In our simulations in Section 5, we use to select in order to make our comparisons with competing frequentist methods more transparent (where was also similarly tuned). However, in practice, fitting the model with large enough works quite well.
4 Asymptotic Theory for the NVCSSL
In this section, we prove several asymptotic properties about the NVCSSL model. To the very best of our knowledge, these are the first theoretical results for Bayesian NVC models when . We assume that there is a “true” model,
(4.1) 
where for fixed and . We let denote the probability measure underlying the true model (4.1). As before, suppose that each in (4.1) can be approximated by a linear combination of basis functions,
(4.2) 
where is a given basis system. Throughout this section, we assume that the basis functions are Bsplines, as they are widely used in practice and the theory for Bsplines is quite wellestablished. For simplicity, we also assume that , noting that this can be relaxed. Our theory will continue to hold if we allow different vector sizes ’s, provided that we add the additional constraint that , as in [16]. Our results would also then be expressed in terms of instead of (see, e.g. Corollary 1 of [35]). For each , let the approximation error be given by
and so model (4.1) can be written as
(4.3) 
Let be an vector with entries, . In matrix form, (4.1) can be expressed as
(4.4) 
where is defined as in (2.5), where is a dimensional vector of the true basis coefficients corresponding to the th covariate, and .
For our theoretical study of the NVCSSL model, we endow in (4.4) with the prior,
(4.5) 
where , and . Our theory builds upon recent work by [2] who studied posterior contraction rates for additive regression under the SSGL prior (2.8) and [19] who introduced a unified theoretical framework for sparse Bayesian regression models. Our work differs from [2] in that we no longer assume homoscedastic, uncorrelated errors, i.e. . Our work also differs from [19] in that we use a continuous spikeandslab prior that penalizes groups of coefficients, whereas [19] employ a pointmass spikeandslab prior with a univariate Laplace density as the slab to penalize individual coefficients.
Remark 1.
For practical implementation of the NVCSSL model, we endowed with a discrete uniform prior. The prior we study here, i.e. , can be obtained as the limiting case when the number of atoms in (2.11) . Our theoretical results can also be shown to hold for the discrete uniform prior by making a few minor modifications to our proofs.
4.1 Dimensionality Recovery for the NVCSSL Model
We first begin with a result on dimensionality. Under our formulation (4.1), determining the number of nonzero functions is equivalent to determining the number of dimensional vectors such that . Since we used a continuous spikeandslab prior in our prior formulation (4.5), our model assigns zero mass to exactly sparse vectors . To approximate the model size under the NVCSSL model, we use the following generalized notion of sparsity [2]. For a small constant which depends on , we define the generalized inclusion indicator and generalized dimensionality, respectively, as
(4.6) 
For the threshold , we use the following:
(4.7) 
As noted in [2], any dimensional vectors satisfying represent the intersection points between the spike and the slab densities in the prior. For large , the threshold rapidly approaches zero as increases, so that provides a good approximation to .
We first state the following regularity assumptions. We denote as the parameter space for . Let denote the set of indices of the true nonzero functions in (4.1), with cardinality . Let denote the maximum number of withinsubject observations.

, ,
Comments
There are no comments yet.