Bayesian Modular and Multiscale Regression

09/16/2018
by   Michele Peruzzi, et al.
0

We tackle the problem of multiscale regression for predictors that are spatially or temporally indexed, or with a pre-specified multiscale structure, with a Bayesian modular approach. The regression function at the finest scale is expressed as an additive expansion of coarse to fine step functions. Our Modular and Multiscale (M&M) methodology provides multiscale decomposition of high-dimensional data arising from very fine measurements. Unlike more complex methods for functional predictors, our approach provides easy interpretation of the results. Additionally, it provides a quantification of uncertainty on the data resolution, solving a common problem researchers encounter with simple models on down-sampled data. We show that our modular and multiscale posterior has an empirical Bayes interpretation, with a simple limiting distribution in large samples. An efficient sampling algorithm is developed for posterior computation, and the methods are illustrated through simulation studies and an application to brain image classification. Source code is available as an R package at https://github.com/mkln/bmms.

READ FULL TEXT VIEW PDF

Authors

page 33

04/22/2021

Multiscale Vision Transformers

We present Multiscale Vision Transformers (MViT) for video and image rec...
04/17/2018

Multivariate Gaussian Process Regression for Multiscale Data Assimilation and Uncertainty Reduction

We present a multivariate Gaussian process regression approach for param...
05/12/2021

Multiscale Invertible Generative Networks for High-Dimensional Bayesian Inference

We propose a Multiscale Invertible Generative Network (MsIGN) and associ...
11/12/2017

A machine learning approach for efficient uncertainty quantification using multiscale methods

Several multiscale methods account for sub-grid scale features using coa...
09/19/2017

varbvs: Fast Variable Selection for Large-scale Regression

We introduce varbvs, a suite of functions written in R and MATLAB for re...
05/25/2022

Bayesian Multiscale Analysis of the Cox Model

Piecewise constant priors are routinely used in the Bayesian Cox proport...
10/01/2021

TyXe: Pyro-based Bayesian neural nets for Pytorch

We introduce TyXe, a Bayesian neural network library built on top of Pyt...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern researchers routinely collect very high-dimensional data that are spatially and/or temporally indexed, with the intention of using them as inputs in regression-type problems. A simple example with a binary output is time-series classification; analogously, brain images can be used as diagnostic tools. In these cases, prediction of the outcome variable and interpretation of the results are the main goals, but obtainining clear interpretation and accurate prediction simultaneously is notoriously difficult. In fact, adjacent measurement locations tend to be highly correlated, with possibly huge numbers of measurements at a very high resolution.

In these settings, directly inputing such data into usual regression methods leads to poor results. In fact, methods for dimensionality reduction that do not take advantage of predictor structure can have poor performance in estimating regression coefficients and sparsity patterns when the dimension is huge and predictors are highly correlated (see e.g. Figure

1). Theoretical guarantees in such settings typically rely on strong assumptions on sparsity, low linear dependence, and high signal-to-noise (irrepresentable; wasserman; Buhlmann11; ieee17).

Figure 1: Brain regions parcellated according to gordon333 and selected in a Lasso model for a gender classification task.

The above problems can be alleviated by down-sampling the data to lower resolutions before analysis. This is an appealing option because of the potential for huge dimensionality reduction. However, any specific resolution choice might be perceived as ad-hoc, and hide patterns at different scales that could instead be highlighted by a multiresolution approach. This problem cannot be solved by methods that somewhat take into account predictor structure, but only act on a single measurement scale, like the group Lasso of glasso, the fused Lasso of fused, the Bayesian method of LiZhang10, and data-driven projection approaches such as PCR (pls_functionaldata).

Alternatively, one can use methods for “functional predictors” (ramsaysilverman; morris; reiss), which view time- or spatially-indexed predictors as corresponding to realizations of a function. Typically this involves estimating a time or spatially-indexed coefficient function, which is assumed to be represented as a linear combination of basis functions. Specifically, multiscale interpretations are achievable via wavelet bases. However, a discrete wavelet transform of the data will not reduce its dimensionality, making down-sampling a necessary pre-processing step for the estimation of Bayesian wavelet regression (see e.g. vannucci). Additionally, wavelets require the basis functions to be orthogonal; this leads to benefits in terms of identifiability and performance in estimating the individual coefficients, while also leading to disadvantages relative to “over-complete” specifications. In fact, wavelets cannot be used when there is uncertainty on multiple pre-specified resolutions that do not conform to a wavelet decomposition (e.g. a hour/day/week structure for time-series). This is also true for ongoing research in neuroscience devoted to the development of parcellation schemes that achieve reproducible and interpretable results (neuro_whichfmri; neuro_parcellations). Researchers might be interested in using these scales jointly in a regression model to ascertain their relative contribution, but this cannot be achieved with wavelets. For classical references on wavelet regression, see donohojohnstone94; donohojohnstone95; lasso_wavelet; from a Bayesian perspective, see e.g. vannucci or jeong_vannucci_ko.

To solve these issues, we propose a class of Bayesian Modular and Multiscale regression methods (BM&Ms), which express the regression function as an additive expansion of functions of data at increasing resolutions. In its simplest form, the regression function becomes an additive expansion of coarse to fine step functions. This implies that multiple down-samplings of the predictor are included within a single flexible multiresolution model. Our approach can be used when (1) there is a pre-determined multiscale structure, or uncertainty on a multiplicity of pre-specified resolutions, as in the case of brain atlases; (2) with temporally- or spatially-indexed predictors, when the goal is a multiscale interpretation of single-scale data. In the first case, our method can be directly used to ascertain the contribution of the pre-determined scales to the regression function. This goal cannot be achieved via wavelets. In the second case, BM&Ms

are related to a Haar wavelet expansion but involve a simpler, non-orthogonal transformation that facilitates easy interpretation, and suggests a straightforward extension to scalar-on-tensor regression. We address the identifiability issues induced by this non-orthogonal expansion by taking a modularization approach. The resulting

BM&M regression is stable, well identified, easily interpretable, and provides uncertainty quantification at different resolutions.

The idea of modularization in Bayesian inference is that instead of using a fully Bayesian joint probability model for all components of the model, one “cuts” the dependence between different components or modules (see

modular; bettertogether and references therein). Modularization has been commonly applied in joint modeling contexts for computational and robustness reasons. For example, suppose that one defines a latent factor model for multivariate predictors , with the goal of using factors in place of in a model for the response . Under a coherent fully Bayes model, will impact the posterior on . This is conceptually appealing in providing some supervision under which one may infer latent factors that are particularly informative about . However, it turns out that there is a lack of robustness to model misspecification, with misspecification potentially leading to inferring factors that are primarily driven by improving fit of the model and are not interpretable as summaries of . Modularization solves this problem by not allowing to influence the posterior of ; motivated by the practical improvements attributable to such an approach, WinBUGS has incorporated a cut(.) function to allow cutting of dependence in routine Bayesian inference (Plummer15).

Our multiscale setting is different than previous work on Bayesian modularization, in that we use modules to combine information from the same data at increasingly higher resolutions. ChenDunson17 recently proposed a modular Bayesian approach for screening in contexts involving massive-dimensional features, but there was no multiscale structure, allowance for functional predictors, or consideration of multiple predictors that simultaneously impact the response.

Section 2 introduces BM&Ms in a general setting, highlighting the correspondence to a data-dependent prior in a coherent Bayesian model. Section 3

focuses on linear regression. Section

4 outlines an algorithm to sample from the modular posterior. Section 5 contains applications on simulated data and to a brain imaging classification task. Proofs and technical details are included in an Appendix.

2 A modular approach for multiscale regression

We consider a regression problem linking a scalar response

to an input vector

of dimension , for each subject . For example, a subject’s health outcome may be linked to the output from a high-resolution recording device. The goal is to predict the outcome variable and explain its variability across subjects by identifying specific patterns in the sensor recording at different resolutions.

We denote the vector of responses by and the raw data matrix by . Each row of can be down-sampled to get a new design matrix , where is a lower resolution grid such that . Down-sampling can be achieved by summing or averaging adjacent columns, or subsetting them. We simplify the notation slightly by calling and . We consider the same data at increasing resolutions in an additive model:

(1)

where is the resolution contribution to the regression function. With this additive multiresolution expansion, it is difficult to disambiguate the impact of the coarse scales from the finer ones, leading to identifiability issues. One may be able to fit

using only a fine scale component, with the coarse scales discarded. If we were to attempt fully Bayes inferences by placing priors on the different component functions, large posterior dependence would lead to substantial computational problems (e.g. very poor mixing of Markov chain Monte Carlo algorithms) and it would not be possible to reliably interpret the different

s. This happens in particular if each is linear, as seen in Section 3.

We bypass these problems by adopting a modular approach (modular; bettertogether), splitting the overall joint model (1) into components or modules which are kept partly separate from each other, to get a modular posterior. In the following, for we use the notation .

Definition 2.1.

Within the overall model (1), module j for data at resolution consists of a prior for , and a model for :

where model estimates via , and are considered known. The output from module is the (conditional) posterior for obtained by prior and model , and we denote it by :

(2)

Thus for the full model (1) we build modules, using increasingly higher resolution data, with each module being a refinement on previous output.

Definition 2.2.

The modular prior distribution for corresponds to , whereas the modular posterior distribution is:

thus collecting the posteriors in (2). Each module refines the output from previous modules by using higher resolution data, and the modular posterior is obtained by aggregating all refinements. Modularity is evidenced by resolution dependence, which is only allowed downwardly, as opposed to letting it be bidirectional as in a fully Bayes approach.

Proposition 2.1.

There exists a data-dependent prior and a likelihood such that . Specifically, is the likelihood corresponding to model , and the data-dependent prior is .

Proof.

See Appendix B. ∎

The above proposition implies our modular approach to multiscale regression will resemble a fully Bayesian model if

, i.e. if the modules agree on the marginal posterior probability of

, the low resolution contribution to the regression function.

3 BM&Ms for linear regression

We now assume that and our goal is to study for in the model

(3)

where . The model includes data up to resolution . We also assume that , meaning that and respectively down-sample and to .111Thus This allows us to decompose , the usual regression coefficient of on as:

(4)

thus interpreting as the contribution of resolution to the regression function.

However, model (3) is ill-posed, leading to problems with standard techniques for model fitting. In fact, the effective design matrix is such that , as the columns of are linear combinations of columns of . One can potentially obtain a well defined posterior through an informative prior for . However, this posterior will be highly dependent on the prior, as the likelihood has many flat regions.

We solve this problem via modularization as in section 2. The modular posterior is

where is the likelihood for the response under , , and are considered known. The posterior distribution for the coarsest scale coefficient is derived treating all the finer scale coefficients as equal to zero; this makes identifiable and interpretable as producing the best coarse scale approximation to the regression function. In defining the posterior for , we then condition on and set coefficients equal to zero, and so on. Linearity allows us to write as

which essentially means that we are estimating (and ) using the residuals from the previous modules as responses for the current module. Hence, the modular posterior for (3) is built using simpler, well-identified single-scale models as modules. For example, we address uncertainty on two resolutions with the following model:222An overview of the recurrent notation is in Appendix A

(5)

assigning priors , for , and using a first module corresponding to model with .

Proposition 3.1.

The modular posterior of for model (5) is where

(6)

with = , and for .333Note that is the posterior mean we would obtain from a single resolution model of the form when we assign prior , . Proof in Appendix C.1.

Finally, note we can estimate in by accumulating all components, i.e. using , which reconstructs the decomposition in (4).

3.1 Asymptotics of BM&Ms in linear regression

We now assume that were generated according to a process such that

a positive definite matrix, and , where with dimension not depending on the sample size. We consider a two-scale linear model like in (5). We assume , , and assign priors on that have positive density on . Finally, we assume is known for each module.

Proposition 3.2.

The modular posterior distribution of in model is approximated by , where

where is the pseudo-true value of under the model .

Proof.

See Appendix C.2. ∎

This is also the asymptotic distribution of , where and is the least-squares estimator obtained by regressing on . Hence, in large samples, BM&Ms correspond to a sequential least squares procedure that regresses the residuals of coarser models on higher resolution data. Our approach propagates uncertainty across multiple stages on finite samples, unlike many two-stage procedures (see e.g. murphytopel for a treatment of two-stage estimators in econometrics).

Corollary 3.1.

The large sample distributions of and are approximated by and , respectively. In other words, accumulating the modular posterior mean components up to results in a consistent and asymptotically efficient estimator for .

Proof.

This is a direct consequence of the properties of the Normal distribution. ∎

4 Computation of the modular posterior

We sample from the modular posterior of 2.2 by sequentially sampling from each module.

for  do
       Start: Draw sample from the module posterior
       for  do
             Draw sample from the module posterior
(7)
       end for
      
end for
Algorithm 1 Sampling from the modular posterior

Obtaining a sample from the modular posterior depends on

, which is determined by the module choice. Therefore, in a multiscale linear regression with conjugate priors as in section

3 we can easily sample from each individual module taking advantage of Eq. (6). In the more complex cases where is approximated via MCMC, we can use the fact that for all , , i.e. the posterior distribution of each module corresponds to the full conditional distribution of the overall modular model, hence sampling from a module’s posterior can be seen as a “modular” Gibbs step. This also means that the computational complexity of BM&Ms is of the same order of magnitude of each component module, since is taken to be small.

5 Applications

5.1 Simulation study

We generate observations from a linear regression model , with

and where is a vector obtained by discretizing the Doppler, Blocks, HeaviSine, Bumps, and Piecewise Polynomial functions of donohojohnstone94; donohojohnstone95; nason_silverman on a regular grid. Notice that covariates are highly correlated, and the sample size is relatively low, suggesting that the regression function may be challenging to estimate at a fine resolution. Our goals are thus: (1) dimension reduction and multiscale decomposition of ; (2) estimation and uncertainty quantification of the relative contributions of different scales; (3) out-of-sample prediction. The module, given , is specified by , with , and with

(8)

the goal being estimation of and the split locations . We use 3 modules and fix . This allows us to estimate an unknown, hierarchical, 3-scale decomposition of . MCMC approximations of the modular posterior decomposition of in the Heavi and Blocks cases are in Figure 2.

Figure 2: Model-averaged BM&Ms decomposition of . The total effect is the sum of the intermediate scale contributions. In red, the true unknown . For Heavi, Scale 3 has little to no contribution to the overall estimation, as evidenced by the 95% credible bands including 0. For Blocks, Scale 3 refines on the previous ones in a few select locations.

Within the above setup, BM&Ms compare favorably to the other models in almost all cases (Figure 3), owing to an (implicit) multiscale structure (Doppler, Blocks), and/or non-smoothness (Blocks, Bumps).444Details on each implemented model are in Appendix D.

Figure 3: Mean Square Error in estimating , and Mean Absolute Prediction Error on 100 samples of size , relative to Fourier regression on the same data (blue dashed line).

5.2 Gender differences in multiresolution tfMRI data

Brain activity and connectivity data plays a central role in neuroscience research today, but increasingly higher-resolution medical imaging devices make management and analysis of such data challenging. We use data from the Human Connectome Project (HCP) (hcp_overview; hcp_pre1), considering a sample of

subjects, with the goal of classifying subjects’ genders using brain activation data during task-based functional Magnetic Resonance Imaging (tfMRI). Gender differences have been observed in neuroscience and linked to brain morphology and connectivity

(hcp_gender2; hcp_gender4), or task-based activity patterns (hcp_gender1; hcp_gender3). For an overview see neurosci101. We use BM&Ms on tfMRI data recorded during the social task in HCP, preprocessed according to the Gordon333 (gordon333) parcellation. This hierarchical partitioning splits the brain into a multiscale structure of 333 regions, 26 lobes, and 2 hemispheres.

While we expect each region to have low explanatory power on its own, it is unclear whether grouping them to form a coarser structure might improve predictive power. In particular, while the coarse-scale 26 lobes are easily interpretable given their connection to known brain functions, they might not be an efficient coarsening for our predictive task. We thus implement BM&Ms in two ways, following the two multiscale points of view of Section 1: (1) we consider the regions-lobes multiscale structure as specified by gordon333; (2) we use regions’ centroid information to adaptively collapse them into coarser groups. In both cases, we consider a binary regression model with probit link, and hence assume that , where is binary and corresponds to the subject’s gender, and is the same subject’s data at resolution .

We adapt the Gibbs sampling algorithm of albertchib93 to BM&Ms . We thus introduce a latent variable for each subject, and alternate sampling from and . Here, is the usual truncated Normal distribution, centered at

, with unit variance, and truncation on zero to the left if

, to the right if . On the other hand, denotes the modular posterior of a linear model as above.

In implementation (1), is a matrix collecting subjects’ brain activation data at resolutions corresponding to regions and lobes, so that . We use Bayesian Spike & Slab modules to estimate for . We are interested in estimating the posterior distributions of , and to understand whether lobes provide additional information compared to the baseline of regions. The estimated posterior means are shown at the top of Figure 4. In implementation (2), we fix to decompose the original higher-resolution parcellation. Implementation of each module follows the lines of (8), but using two-dimensional spatial information, as detailed in Appendix D.2. The estimated posterior means are shown at the bottom of figure 4.

Figure 4: Posterior means estimated with BM&Ms . Top: lobes-regions multiscale structure. Bottom: Multiscale decomposition (left) of an estimated coefficient image (right) in a scalar-on-image regression setting.
Figure 5: Coefficient on brain regions estimated via four alternative single-scale models.
Model Accuracy AUC Multiscale Bayesian Scalar-on-Image
BM&Ms (1) 0.806 0.889
BM&Ms (2) 0.810 0.889
Spike & Slab 0.800 0.883
Ridge 0.792 0.871
Lasso 0.749 0.824
Functional PCR 0.781 0.856
Table 1: Correct classification rate (Accuracy), and area under ROC curve (AUC), on random samples of size , averaged across 200 resamples of the data.

We compare BM&Ms to competing models on the same data. Table 1 reports the out-of-sample performance of all tested models, whereas Figure 5 shows the estimation output of the competitors.555Refer to Appendix D for details on the implemented models. BM&Ms

perform at least as well as competing models, while providing additional information on the multiscale structure. On one hand, ridge regression achieves good out-of-sample performance but the resulting estimates prevent a clear understanding of the results. Lasso regression would lead researchers to believe some regions might account for gender differences, yet its performance is the worst among tested models. Spike & Slab priors result in many regions being infrequently selected, and this may lead researchers to believe that coarser scale information from lobes might be useful. However, applying

BM&Ms on the regions-lobes multiscale structure results in little to no improvement in predictive power, with lobes being selected rarely. Finally, when considering the spatial structure of the brain regions, the estimated coefficient images of BM&Ms and FPCR are similar. However, the BM&Ms image can be decomposed in coarse-to-fine contributions (shown in Figure 4, bottom left), which aid in the interpretation of the final estimate.

6 Discussion

In this article, we introduced a Bayesian modular approach that builds an overall model by the sequential application of increasingly more complex component modules. Our approach can be applied to multiscale regression problems in two common scenarios: (1) when multiple resolutions of the data could be used to model the regression relationship, to assess their contribution to the regression function; (2) when the focus of analysis is on a multiscale interpretation of results. Compared to established methods for multiscale regression such as wavelets, our method is more flexible and with the potential of easier interpretation. Both simulations and real data analysis show that this is not achieved at the expense of performance.

The modular posterior of BM&Ms is the product of each module’s posterior. This implies that our method inherits properties of the chosen component modules. Choosing component modules requires clarity on what the objective of analysis is. For example, we showed in Section 5

that we can use variable selection modules when the resolution structure is pre-specified, and changepoint-detection modules to find a multiscale interpretation. However, other module choices can be explored for different objectives, and overall models with mixed-type modules might offer additional interpretation opportunities. Additionally, our approach can also be used in non-parametric regression settings by considering the identity matrix

as the high-resolution design matrix.

References

Appendix A Notation

output vector of size
resolution of the data,
unknown “true” resolution
design matrix at resolution , size
data matrix corresponding to
“true” coefficient vector, such that
matrix such that
matrix such that
coefficient vector in single-resolution model
posterior mean for in the single-resolution model
LS estimate of in the single-resolution model
variance of
coarsening operator such that
coarsening operator such that
th component of the multiresolution parameter
multiresolution parameter for a K-resolution model
multiresolution decomposition of the coefficient vector
prior for
posterior for obtained with module

Appendix B Appendix to Section 2

Proposition B.1.

There exists a data-dependent prior and a likelihood such that . Specifically, is the likelihood corresponding to model , and the data-dependent prior is .

Proof.

We consider without loss of generality the case with two modules represented by data and , respectively. Here, corresponds to the reduced model , while to . We build the modular posterior using the modules’ posteriors, and multiply and divide by :

We thus interpret as the likelihood divided by the normalizing constant. This means the modular posterior can be obtained via the full model where both and are unknown, and a prior on which depends on the data through the following factor:

We then use the fact that according to model ,

and hence we obtain that

Since , we can write the modular posterior as

Appendix C Appendix to Section 3

c.1 Two-scale BM&Ms posterior

The modular posterior of for the BM&Ms of model (5) is

(9)

where we denote = , and with are the posterior means we would obtain from single resolution models of the form

(10)

when we assign prior .

We build module 1, with a prior and a model :

From this we obtain the posterior

Conditioning on , we then build module 1:

Hence, the posterior for will be

We find the posterior distribution via modules 0 and 1:

The end result follows from the properties of the Normal distribution.

c.2 Asymptotics for BM&Ms

The goal of this section is to show that the modular posterior in linear models is approximately normal in large samples, with a mean that is a composition of the (rescaled) pseudo-true regression coefficients at different resolutions. In order to do so, it will be sufficient to show that each module results in approximately normal (conditional) posteriors. This will ensure normality of the overall modular posterior, by the properties of the normal distribution.

We consider response vector and data matrix , and following geweke2005 we assume that were generated according to a process such that

a positive definite matrix, and

where with dimension not depending on the sample size, and is known. We consider a finite set of predetermined resolutions , corresponding to , respectively, such that and for some and , . Here, and are coarsening operators that perform partial sums of adjacent columns and hence reduce the data dimensionality to , from or , respectively. In other words, we assume , the data at available resolution , could be obtained as coarsening of the true-model data , and that each of the intermediate resolutions can be obtained as coarsening of higher resolutions. Given these assumptions, we have

Note that we do not assume necessarily for some . In other words, if is the resolution of the data corresponding to the true regression coefficient , it may hold that for all . The goal is to find the “best approximation” of at the different predetermined resolutions.

We consider without loss of generality the overall model

where and , respectively. We will assume that the prior for when using component module has positive density on a neighborhood of the corresponding pseudo-true parameter value . At the first stage, we use model

with . For now, we assume is known. In large samples, at this resolution and with our assumptions, the Bayesian posterior will be approximated by , where is the pseudo-true value of the regression coefficients at low resolution , and is the second derivative of the log-likelihood of this module (this corresponds to results in bda3, geweke2005, kleijn2012). At the second stage, we fix and consider the model

with . Here, we assume both and are known. In large samples, at this resolution and with our assumptions, the Bayesian posterior of will be approximated by , where is the difference between , i.e. the pseudo-true value of the regression coefficients at resolution , and the rescaled . is the second derivative of the log-likelihood of this module.

The modular posterior is by definition (2.2) the product of the two component modules’ posteriors. Since both are normal in large samples, the modular posterior will also be approximately normal in large samples, with mean