## 1 Introduction

In this paper, we discuss linear regression models for two or three dimensional image responses, image covariates, or both, in which the signal is assumed to be continuous, sparse and piecewise smooth. The methodological development is motivated by a study of multiple sclerosis using magnetic resonance imaging

(Sweeney et al., 2016; Pomann et al., 2016; Mejia et al., 2016), where subjects with multiple sclerosis (MS) are imaged repeatedly over multiple hospital visits, and the objective is to identify the brain regions that are damaged over time. Although a healthy brain would not change much during the study period, a diseased brain is expected to exhibit changes in a small number of regions of interest that are associated with the disease. This is an example of image-on-scalar regression, in which the signal is desired to be continuous, sparse and piecewise smooth.Modeling a continuous, sparse and piecewise smooth signal for high-dimensional data poses several challenges such as: 1) complex spatial dependence of the data exhibits piecewise smoothness in the signal, 2) the signal is expected to be simultaneously sparse and continuous; here sparsity is defined in terms of the number of non-zero smooth pieces that define the signal and 3) the dimension of such signal is often very large.

For sparse estimation, there are some traditional approaches that we discuss here. In a frequentist framework, there is lasso-type penalization (Tibshirani, 1996) but this cannot ensure smooth changes from zeros to non-zero subregions. Using fused lasso (Tibshirani et al., 2005), one can ensure both sparsity and smoothness in the estimation. None of these approaches allow for quantification of estimation uncertainty. But the approach imposes huge computational demand. In a Bayesian framework, parameter sparsity is modeled using the traditional spike and slab prior (Mitchell and Beauchamp, 1988), the horseshoe prior (Carvalho et al., 2010), normal-gamma prior (Griffin and Brown, 2010), double-Pareto prior (Armagan et al., 2013) or Dirichlet-Laplace prior (Bhattacharya et al., 2015). However, none of these priors ensures a smooth spatial structure. In the context of high dimensional data, this adds computational challenges as well. Reducing computational demand is one issue that we try to address in this paper.

We review some of the research on sparse and spatially smooth parameter estimation for different types of image regression. For image-on-scalar regression, Yan and Liu (2017) and Chen et al. (2016) tackled a similar problem. The first paper considers a Laplacian type penalty and the second paper introduces a fused SCAD type penalty to account simultaneously for spatial smoothness and sparsity. In the context of functional magnetic resonance imaging (fMRI) studies Zhang et al. (2016) and Musgrove et al. (2017) too considered similar regression models. Their estimation approach uses a spike and slab prior to induce sparsity and considers spatial smoothness for the selection parameter. However, the approach does not ensure that the estimated signal is smooth. In scalar-on-image regression, there is limited work on sparse and piece-wise smooth signal estimation. Goldsmith et al. (2014) and Li et al. (2015) proposed priors that account separately for spatial dependence and sparsity. For the same problem, Wang and Zhu (2017) proposed a penalty based on total variation. The spatial dependence is still not fully incorporated in this approach. In Kang et al. (2016), the proposed soft-thresholded Gaussian process prior account for both spatial dependence and sparsity simultaneously. The thresholding parameter is expected to vary spatially to capture signals at different spatial locations. But that would make this method computationally very expensive. In the context of image response and image predictors, there is a more limited research (Morris et al., 2011; Jog et al., 2013; Sweeney et al., 2013). Noh and Park (2010), Tang et al. (2013) considered a varying co-efficient model that accounts for sparsity but not smoothness. To the best of our knowledge, only Boehm-Vock et al. (2015) and Jhuang et al. (2018) consider both smoothness and sparsity for image-on-image regression. Their methodology captures the spatial dependence using copulas, which is computationally expensive for large datasets. Our prior has nice conjugacy structure which leads to computational advantages.

In this paper, we propose a novel prior which can be used to estimate continuous, sparse and piecewise smooth functions. We take location wise product of independent Gaussian processes with smooth covariance kernel to construct this prior. The proposed prior has a both high mass around zero, which creates sparsity in the estimation, and a smooth a covariance kernel that ensures a large support for the spatially varying function. To handle the heavy computational burden associated with this kind of prior, we propose to use a transformation that decorrelates the data. Specifically we use the fast Fourier transformation (FFT) that not only has data-reduction advantages, but also decorrelates the stationary part of the response. The FFT algorithm requires regularly spaced input data. In reality, the datasets are not often on a regular grid. To bypass this issue, we propose a fast imputation technique to transform the data into a regular grid. If the dimension of the dataset is manageable for computation in the spatial domain, one can exploit the conjugacy structure of our prior to get the full conditional distribution of parameters given the error process is Gaussian. We analyze the performance of our prior with respect to commonly used Gaussian process (GP) prior in different linear image regressions with signals that are sparse, piecewise smooth and continuous.

We organize remainder of the paper as follows. In the next section, we describe the image-on-scalar regression model along with the new sparse prior process. We discuss the usage of our new prior to other image regression models in Section 3. In Section 4, we describe other computational aspects that we use for faster computation. In Section 5, we provide several simulation results to evaluate the performance of this new prior for different image regression models extensively. We apply our method to a longitudinal magnetic resonance imaging (MRI) data in Section 6 and end with some concluding remarks in Section 7.

## 2 The modeling framework

Our research is motivated by a longitudinal study of multiple sclerosis (MS) via magnetic resonance imaging (MRI) images. We introduce the main ideas for the case when we have images collected at multiple time points for a single subject. Specifically, let

be the intensity of - MRI image collected at time and for a 3-dimensional voxel for an MS subject. Consider the following linear image-on-scalar regression model(1) |

where is a spatially-varying intercept function and is an unknown continuous function that quantifies the effect of time, it is assumed that is in addition piecewise smooth and sparse. The error is a spatially-correlated mean-zero error process, independent across visits. We propose to model as a product of independent Gaussian processes (GPs). We formally describe its properties and the error process in the remainder of this section.

### 2.1 PING process

Let be independent and identically distributed GPs with mean

, variance

, and covariance kernel Cov{}= for . The zero-mean Product of INdependent Gaussian (PING) stochastic process is defined as the point-wise product of independent Gaussian processes (GPs), where is a scale parameter. The stochastic process constructed in this way is denoted .#### 2.1.1 Properties of the marginal distribution

We first discuss the distribution of the PING process at a single location . The theoretical properties of the marginal distribution of have been studied by Stojanac et al. (2017) and Gaunt (2018). Gaunt (2018)

provides detailed results on characteristic function and propose estimates for the tail behavior of product normals. We briefly revise some of its properties here for completeness. The marginal density function

for the product of standard normals is given by where denotes the Meijer G-function (Stojanac et al., 2017). Themarginal moment is

where is the product of all numbers from 1 tothat have the same parity i.e. whether the number is odd or even as

. The density is unimodal and symmetric about zero; thus, all the odd-order moments are zero. The variance is . The marginal kurtosis is equal to which is an increasing function of . As a result, the marginal density has thicker tail and sharper peak at zero for larger . This is depicted in the first row of Figure 1. Furthermore, as goes to infinity and thus the tail is heavier than Gaussian for and it gets heavier as increases; (see Gaunt (2018))#### 2.1.2 Properties of the bivariate distribution

Next, we study the bivariate properties of the PING process at a pair of locations and . From the construction of the PING process with components, this bivariate distribution is in fact the distribution of the product of bivariate normals. Simple calculations show that its mean is for , and its covariance is , implying a correlation coefficient that is smaller than the correlation of each individual Gaussian components and that further decays with the number of components, . In particular, if is the powered exponential correlation kernel, , the PING covariance is . Therefore, while the covariance decreases with for a fixed kernel function, strong spatial correlation can be maintained for large by simply increasing the parameter with

. The smoothness of the product process is the same as that of its individual components for these powered exponential cases. We expect that separating sparsity and spatial dependence should hold for other kernel functions as well. To quantify the shrinkage properties, we study the kurtosis of this product distribution. Kurtosis of a multivariate random variable

of dimension with mean and covariance matrix is defined as (Mardia, 1970). The kurtosis of a general product of bivariate normal random variable is summarized by the following theorem.###### Theorem 1.

Let be such that for and . The mean and the covariance matrix of are and

The kurtosis is , with and it increases with .

Here “BVN” stands for bivariate normal and

denotes the element wise product. We allow varying variances for individual components as the kurtosis does not depend on the variances. See Supplementary materials for details. The “increasing” property of the kurtosis results from application of arithmetic and geometric means inequality. The distribution of

for two locations and under PING process has the above mentioned properties with for . Since it is a unimodal symmetric distribution, higher kurtosis suggests a heavier tail and higher peakedness at zero as increases; Figure 1 depicts the joint density function of and for and , and for different correlations. In this plot, we observe that the mass at zero increases with while they share same covariance structure with unit variance term. Figure 1 of the Supplementary Material shows the conditional density of for an arbitrary location given with. The conditional density at one location tends to have shorter peak as the value at other location moves away from zero. Also the conditional densities tend to be more positively skewed, as we condition on higher values for the other location.

#### 2.1.3 Multivariate properties

Let

be random vector of length

where MVN and MVN and has diagonal entries equal to one. Then we have for all and , as increases to infinity where is the identity matrix. The distribution of for a finite set of locations has the above mentioned properties as have the same covariance kernel with one total variance. In general, calculations of the kurtosis in multivariate setup is very difficult to obtain. But we can prove the following result.###### Theorem 2.

The multivariate kurtosis of increases with .

This can be proved using the method of induction and Theorem 1. The proof is in the Supplementary materials.

In summary for , the PING process is the standard GP and as

increases mass near zero and tail probabilities increase and with an appropriate rescaling of the spatial correlation parameters, it maintains the smoothness properties of the original GP. Therefore, the PING process is an attractive model for sparse and smooth signal. The number of terms

clearly plays a major role in the application of the PING process, and we recommend to select this tuning parameter using cross validation.### 2.2 Error distribution and Matern correlation

Next, we discuss the error process . To account for both large and small scale spatial deviations of we consider the following decomposition, similar to Reich et al. (2018) of ,

(2) |

where the first term is a linear combination of known basis functions ’s and are unknown coefficients for - visit and - basis and captures the large-scale deviation. The second term is intended to capture small scale deviations. We assume that is mean-zero GP with stationary and isotropic Matern covariance function as follows:

(3) |

where and is the modified Bessel function of the second kind. The Matern covariance has four parameters , that represent the variance of the non-spatial error (nugget), the variance of the spatial process (partial sill), the spatial range and the smoothness of the correlation function respectively.

The large-scale spatial structure is described by random-effect covariates . Among many different choices for ’s, we consider outer product of B-spline basis functions. The error in approximation of the non-stationary covariance function using B-splines decreases at a rate of (Shen and Ghosal, 2015), where is the number of B-spline basis and is the regularity of the process. Thus, with large

, this can approximate any covariance function. We assume the random effects are normally distributed, i.e.,

, where is the covariance matrix. The nonstationary component of the covariance is(4) |

## 3 Extension to other image regression models

The model in the previous section is designed for image-on-scalar regression. Our sparse prior can easily be adopted for other image regressions as described below.

### 3.1 Image-on-image regression

Consider the case of a linear image-on-image regression model (see for example Gelfand et al. (2003); Morris et al. (2011); Jog et al. (2015, 2017); Hazra et al. (2017))

(5) |

where is the image response and the ’s are the image predictors for subject . Here is an unknown intercept as before and are spatially varying piecewise smooth and sparse covariate effects, and is the error process.

We put the PING prior on each of for sparse and smooth estimation. This gives local variable selection as the subset of the covariates with beta shrunk towards zero changes with .

### 3.2 Scalar-on-image regression

Finally consider the case of a scalar-on-image regression model (see Wang and Zhu (2017); Kang et al. (2016); Goldsmith et al. (2014); Li et al. (2015)). This model is

(6) |

where is the scalar response and is an image with spatial locations for subject . Here are spatially varying piecewise smooth and sparse covariate effect, and is the error which follows N(). We again put a PING prior on for sparse and smooth estimation and its performance is studied Section 5.

## 4 Computational details

The prior on intercept () and each component of the PING prior (’s that comprise the PING prior) are assumed to be mean-zero GP with stationary and isotropic Matern covariance function:

(7) | ||||

(8) |

No nugget variance is assumed for the components of PING to ensure smoothness.

For small and moderate datasets, standard Markov chain Monte-Carlo (MCMC) algorithms apply to the PING model and computation is straightforward. One advantage of the PING prior is the elements of the

- component have multivariate Gaussian full conditional distribution given the other GPs, and thus Gibbs steps can be used to update the PING process parameters. For large however, these updates become slow and we use spectral methods, described in the remainder of this section.### 4.1 The model in spectral domain

Similar to Reich et al. (2018), we partially decorrelate the data by using the discrete Fourier transformation (DFT). Let us denote spectral representation of the processes and as and for frequency . Since discrete Fourier transformation (DFT) preserves linearity, the spatial model in (2) in the spectral domain can be written as

(9) | |||

(10) |

The notation denotes convolution. The Gaussian process and are stationary and defined over a discrete spatial domain. In order to avoid computationally expensive Bessel function and spectral aliasing calculations, we use the quasi Matern spectral density (Guinness and Fuentes, 2017), which mimics the flexibility of the Matern spectral density for and ,

(11) |

where is the dimension, and . More specifically,

(12) | ||||

where if and otherwise. All the parameters in have the same interpretation as in (3). For , the nugget variance is zero.

### 4.2 Imputation method

Each spectral element is a function of for all , and thus spectral methods require complete data. However in practice, data are often not collected on a complete regular grid and thus the response is missing at many locations if we transform it into a regular grid. For example, in brain images we consider the complete regular grid as the 3D cube in which the skull is inscribed; from this perspective the medical images involve missingness. Missing values are handled naturally in a Bayesian context within a Gibbs sampler that draws the missing values from their conditional distribution given the observed data and the other parameters. Because imputation is applied during each MCMC iteration to account for imputation uncertainty, this step must be computationally efficient.

Denote the conditional mean by

and define to be the vector of observed data for subject and to be the vector representing the missing values. Likewise, let and be the corresponding vectors of means. The conditional distribution of given all of the other parameters is

(13) |

and thus the conditional distribution of given and the rest of the parameters is normal with mean and covariance .

For large datasets directly sampling from this distribution is infeasible. The limiting computational task in computing the conditional mean is solving a linear system with . Since is symmetric and positive definite, this can be achieved with a preconditioned conjugate gradient (PCG) algorithm (Golub and Van Loan, 2012), an iterative method for solving the linear system . The goal of iterative linear solvers is to generate a sequence that converges to . The algorithms generally require us to compute at each iteration to check for convergence and to generate the next vector in the sequence, and thus the algorithms are fast when this forward multiplication can be computed quickly. In this case, forward multiplications with can be computed in time and memory with circulant embedding algorithms (Wood and Chan, 1994), as can the forward multiplication with . This is because and can be embedded in the larger circulant matrix , that is,

and fast Fourier transform can be exploited to compute the forward multiplication with the (nested block) circulant matrix , since (nested block) circulant matrices are diagonalizable by the (-dimensional) DFT. The preconditioned conjugate gradient algorithm uses an approximate inverse of , called a preconditioner, to encourage the sequence to converge to is a small number of iterations.

Completing the imputation step requires us to simulate a residual vector with covariance matrix . To accomplish this, we first simulate a vector with mean zero and covariance as in (13), which is again efficient with circulant embedding. Then we form and the residual , which has the desired and can be computed in the same fashion as the conditioned mean. Further computation details for the conditional draws can be found in Stroud et al. (2016) and Guinness and Fuentes (2017).

### 4.3 Sampling

The fast Fourier transformation (FFT) of the PING process parameters is the convolution of frequencies as in (10). Conducting a full conditional Gibbs update, even in the spectral domain, is computationally expensive. The existing Metropolis techniques for joint update of large coefficient vectors, such as the gradient adjusted Metropolis Hastings (Roberts and Rosenthal, 1998) or Hamiltonian Monte Carlo (Duane et al., 1987) are very slow mixing. Thus we introduce this new sampling technique. Our proposed sampling technique has a Gibbs step based on an approximated model and then a function of the Gibbs update is treated as a candidate for the Metropolis step corresponding to the original model. We explain the technique for two-component case, and this can be generalized for any number of components, . Let where and are GPs, then the model in (1) can be re-written as,

(14) |

Denote the estimated values at the - stage of the MCMC iteration as (samples using PCG), , and . We can calculate the error at - stage as .

We can rewrite our model in (14) as

Except for and the last , all other values are replaced by the ones from the step. Then it would look like,

(15) |

We now perform a Gibbs sampling to get update, , of from the model (15) in spectral domain given the values from step. Now we consider a Metropolis step corresponding to the original model in (14). The candidate for this step is ; here acts as a tuning parameter and denotes the norm, defined as . Smaller values of

generate higher acceptance rate and vice versa. In summary, with a Gibbs update from the reduced model, we identify the direction of highest posterior probability region from the current value of the parameter for the reduced model and then we make a Metropolis move in that direction from the current value for the actual model. The reduced model becomes similar to the actual model when the error does not change much over the MCMC iterations which is expected to happen when the chain converges; this step is described in detail in Algorithm

1. In our simulation, we adjust to maintain an acceptance rate of around 0.6 for this scheme to ensure good mixing. If acceptance is lower than 0.55, we lower the value and and if it is higer than 0.70, we increase under the restriction that. For PING with more than two components, when updating one of the components we consider element-wise product of all other components as one component. Then it reduces to a two component case. Among the Matern parameters, except for total variance all other parameters are updated using Metropolis sampling and total variances are updated from their posterior inverse gamma distributions.

We use this spectral method for all image-on-scalar regressions in this paper. For the simulated image-on-image and scalar-on-image regressions in Section 5 the datasets are small and we use Gibbs sampling for the PING process parameters. For larger problems, the Metropolis scheme explained above could also be adapted to image-on-image and scalar-on-image regressions.

## 5 Simulation results

In this section we present simulation results for all three regression setups, namely image-on-scalar regression, image-on-image regression and scalar-on-image regression. We compare the results in terms of mean squared error (MSE), true positive (TP) and false positive (FN) for different levels of signal to noise ratios (SNR). We calculate confidence bands of the estimates to get the proportions TP and FN. TP is defined as the proportion of locations such that the confidence band does not included zero given the true parameter value is non-zero. FN is defined as the proportion of locations such that the confidence band does not include zero given the true parameter is zero.

### 5.1 Image-on-scalar regression

Here we consider the image-on-scalar regression model in Section 2 for images of dimension with visits. The model is

(16) |

here with and ’s are 20 equidistant points such that and obtained by standardizing the times . The true signal is zero for most of the spatial locations but has subregions that are non-zero. Let, and , . The true signal is if and otherwise. The plot of the true slope is in the Supplementary Material (Figure 2). The error process is assumed to be GP with stationary Matern covariance function. The true reparametrized Matern parameters for intercept process are and last three parameters for error are . After generating the data on grid, we treat the values missing outside of the inner grid . Here we use the imputation technique to impute those missing values in our estimation.

We put Gaussian process prior with Matern covariance function on the intercept process and PING prior (Section 2.1) on the slope. We represent , and as the Matern parameters for the error, intercept and first component in PING process prior on slope respectively. For other components of the PING prior the Matern parameters are . We reparametrize the Matern parameter to as and . Here is called the total variance. The total variance of error is set at 0.09, 0.017 and 0.009 to achieve different SNRs which are mentioned in the Table 1. All these results, compiled in Table 1 are based on 50 replications and 10000 post burn samples after burning in 10000 samples.

We fit the model with and priors : Gamma(0.1, 0.1);

N(0,1) and N(0,1). For PING process: We set (as nugget variance is zero) and N(0,1). The priors are same for the next two regressions as well.

From the values in the Table 1, we infer that for lower SNR, more components in the PING process prior leads to better estimation. Figure 3 of Supplementary Material compares the estimates for one slice of the 3-D slope across different methods along with the true . Gaussian process prior overestimates the regions where the true value is zero as shown in Figure 3 of Supplementary Material. This results in higher false positive and higher MSE for locations where the true value is zero. Here all methods have high power.

### 5.2 Image-on-image regression

We consider image-on-image regression model as in Section 3

(17) |

on data collected over 100 locations, selected at random in with observations at each location. The ten spatially varying predictors (’s) are generated using the reparametrized Matern parameters, generated randomly. First a random vector of four elements are generated from N(0, 1). We exponentiate first, third and fourth element and take inverse logit transformation of the second element to get those reparametrized Matern parameters for each predictor. These predictors are generated only once for the whole simulation. We change the domain of the images for this simulation from previous case and only consider data at 100 locations to construct a dataset of manageable dimension for easier computation.

The error process is assumed to be GP with stationary Matern covariance function, independent over . And for . Rest of those five ’s have the structures, plotted in Figure 4 of the Supplementary Material. These are zero at most of the locations with some non-zero subregions. To generate , we divide the whole space into a grid. Then we generate a random number in . Then we generate set of co-ordinates in . Let these be . Let us define . Then if and otherwise. Other four ’s are generated similarly. The true reparametrized Matern parameters for intercept are and last three parameters for error are . The total variance of error is set to 0.57, 0.11 and 0.06 to achieve different SNRs which are mentioned in the table. We report MSE, power, false positive and coverage averaged over . All these results, compiled in Table 2 are based on 50 replications and 5000 post burn samples after burning in 5000 samples.

In Table 2, we see that the PING process prior always gives better estimate in terms of MSE. The GP prior overestimates the regions where the true value is zero. This results in higher false positive for GP prior. Here all methods have similar power. As the SNR increases, the results using PING are even better than those using Gaussian.

### 5.3 Scalar-on-image regression

Finally we replicate the simulation from (Kang et al., 2016) with 100 observations. For each observation, there is a two-dimensional image of dimension with an exponential covariance structure having range parameter 3. The model is

(18) |

Here the coefficient is a matrix of dimension . The true is generated in such a way that it has five peaks. Let, and and . Then mathematically the true beta is if and otherwise. Only in this set up we have number of observations much less than number of parameters to be estimated. We consider three choices of in generating the data, 0.1, 1 and 1.5.

We fit the model with . The prior for is . Rest of parameters have the same prior as in previous subsections. In this subsection we also compare our method with fused lasso (Tibshirani et al., 2005)

and functional principal component analysis (fPCA)

(Jones and Rice, 1992). Fused lasso estimates are computed using an archived version of previously available genlasso package in R. After smoothing the images using fbps function of refundpackage, eigen decomposition of the sample covariance is computed. After that lasso regularized principal components regression is performed. The leading eigenvectors that explain 95% of the variation in the sample images are used to get the final estimate.

We report MSE, true positive, false positive and coverage in estimation of the slope matrix. All these results, compiled in Table 3, are based on 50 replications and 5000 post burn samples after burning in 5000 samples. Here, we see that the estimates from the PING process prior are superior to those of the GP, fused lasso and functional PCA (fPCA) in all metrics. In particlar, fPCA results are very noisy. In Figure 2 the estimated parameters from PING are less noisy than all other estimates.

We also compare PING with STGP method (Kang et al., 2016). While comparing with STGP, we consider low rank approximation of each component of PING. The low rank approximation is incorporated following the works in the kernel convolution of Higdon et al. (1999) and Nychka et al. (2015) as in Kang et al. (2016). Due to this modification, all the results change from the previous table and the comparison results with STGP are provided separately in Table 4. We only show results using PING-8, as they are the best based on MSEs in a grid of choices, ranging from 2 to 20. In the supplementary materials, a plot of MSEs for different choices of is given.

We can see that the estimates from PING process prior are far better than STGP in terms of overall MSE, true positive and coverage except for the MSE at the subregion where the truth is zero. The STGP and PING estimates are almost indistinguishable in picture. Due to thresholding, STGP estimates are more conservative. Thus, a pictorial comparison will not be much informative and we have not presented it.

Comments

There are no comments yet.