In neuroimaging studies, it is of great interests to learn the potential factors that may influence the brain activity and identify the consistent brain activation patterns across multiple subjects. The regression models have been adopted as a useful tool to achieve this goal, where the outcome images can be the structural magnetic resonance imaging (MRI) data (e.g. T1 weighted image), the contrast map from the task functional MRI, and the voxel-level summary statistics from the resting fMRI data (e.g. the weighted degree and fALFF). The outcome image typically consists of the imaging measurements from different subjects at a large number of voxels. The potential predictors in the model may include multiple scalars such as demographic information, clinical characteristics and other non-imaging measurements. We refer to this type of regression models as the image-on-scalar regression model.
The main challenges of analyzing neuroimaging data using the image-on-scalar regression come from the complex spatial dependence of the brain signals, the heterogeneity in the brain activation patterns, the large variability of noises due to the measurement errors, and the limited number of subjects. First, according to the principle of the imaging acquisition techniques and the characteristics of neural activity, the brain signals are likely to be sparsely distributed in the brain and clustering in several spatially contiguous regions with relatively sharp edges. For example, some brain regions such as left and right amygdala are often activated for a feeling of fear, which can be detected by the emotion task related fMRI. The associated contrast maps tend to be substantially stronger in the gray matter of these regions than those in other regions. Second, brain imaging signals among multiple subjects appear to be heterogeneous. The subject-specific brain signals not only depend on the observed demographical variables such as age and gender, but also may be related to unobserved variables such as the underlying medical conditions and irrelevant thought or emotion at scan. Third, due to the machine artifacts and the imperfect prepossessing steps, neuroimaging data may suffer high levels of noises, resulting in relatively low signal-to-noise ratios. Finally, the number of subjects in a typical neuroimaging study is rather limited, with usually tens to hundreds of subjects. There have recently emerged some aggregated neuroimaging databases, but the number of subjects is still limited to a few thousands. This is a far smaller sample size for the usual deep neural network (DNN) modeling, which partially explains relatively scant success of DNN in neuroimaging applications.
. To address the above challenges, we propose a neural network-based image-on-scaler regression mode (NN-ISR), in which the main and individual effects are constructed through deep neural networks. Compared with the existing solutions, NN-ISR is more flexible in learning complex patterns, can better capture the noisy and heterogeneous patterns, and is capable of handling small sample sizes. For parameter estimation, we develop an efficient voxel-batch hybrid gradient descent (VHGD) algorithm, which can be easily implemented using common deep learning software packages. We establish the theoretical bounds for the estimation error and selection sign error when the sample size is fixed and the number of voxels goes to infinity. Finally, we demonstrate the efficacy of our method by comparing it to existing methods in extensive simulation studies and the analysis of two real-world datasets from the Autism Brain Imaging Data Exchange (ABIDE)(Di Martino et al., 2014) and the Adolescent Brain Cognitive Development (ABCD) study (Casey et al., 2018).
2 Related work
The most straightforward method for fitting the image-on-scalar regression is the mass univariate analysis (MUA). This approach fits a general linear model (GLM) at each voxel and obtains voxel-wise test statistics to identify brain regions that are significantly associated with a specific predictor after the multiple comparisons correction. One standard procedure is to calculate the family-wise error rate (FWER) in light of the random field theory for statistical parametric maps(Friston et al., 1995). Another approach is to control the false discovery rate (FDR) using the observed -values (Benjamini and Yekutieli, 2001). As a major limitation, MUA does not borrow information from the spatial dependence across voxels and thus lose power to detect brain activation signals and potentially may increase the false positive rate. In practice, the neuroimaging data are usually pre-processed by a spatial smoothing procedure using a kernel convolution approach. Performing MUA on these pre-smoothed data may lead to inaccuracy and low efficiency in terms of estimating and testing the predictor effects (Chumbley et al., 2009). Recent development in adaptive smoothing methods for preprocessing (Yue et al., 2010) and estimation (Polzehl and Spokoiny, 2000; Qiu, 2007)
may improve the performance in terms of reducing noise and preserving features. It is especially powerful to detect delicate features such as jump discontinuities, which is commonly encountered in neuroimaging data.
Zhu et al. (2014) recently developed a systematic approach referred as spatially varying coefficient model (SVCM) that incorporates both spatial smoothness and jump discontinuities. General SVCMs have been extensively investigated for different applications in environmental heath, epidemiology and ecology as demonstrated by (Cressie and Cassie, 1993; Diggle et al., 1998; Gelfand et al., 2003). The SVCM encompasses a wide range of regression models with the outcome variable observed over space and the regression coefficients modeled as functions varying spatially. We refer to this type regression coefficients as spatially varying coefficient functions (SVCFs). SVCFs are commonly assumed to be smooth functions or continuously differentiable functions up to certain degrees. (Zhu et al., 2014) extended the general SVCMs by introducing jump discontinuities into the SVCFs, making the model especially useful for neuroimaging data analysis. Based on the step-wise estimating procedures and the asymptotic Wald tests, (Zhu et al., 2014)
’s SVCM also can identify brain regions that are significantly associated with the given covariates, although it is not developed particularly for feature selection. Other recent methods include imposing a penalty function(Chen et al., 2016), approximation by bivariate splines over triangulation (Li et al., 2020) simultaneous confidence corridor based on spline smoothing (Gu et al., 2014), isotropic Gaussian process priors (Bussas et al., 2017), and thresholded multiscale Gaussian processes (Shi and Kang, 2015). However, the spline-based approaches are not directly applicable for 3D or 4D imaging data and the Bayesian approaches suffer the high computational burden.
3.1 Image-on-scalar regression
Let denote the
-dimensional Euclidean vector space. Suppose the imaging measurements and covariates are collected fromsubjects. Let be the space for measuring brain activity. For subject and any location , let represent the imaging measurement of brain activity, and let be a vector of the covariates that may influence the brain activity. Given , we assume that is generated from the following image-on-scalar regression model:
where is a vector of SVCFs of that represents the main effects of the covariates on brain activity at location , and is a scalar SVCF of that represents the subject-specific individual brain activity variation from . We assume that is sparse and piecewise continuous on a finite connected partition of . This assumption is weaker than a similar assumption made by SVCM (Zhu et al., 2014), which requires the main effects to be piecewise smooth and does not allow the partition to contain isolated points. Write as a vector of individual effect functions. Note that was treated as a stochastic random process in SVCM (Zhu et al., 2014), whereas it is a fixed function in our model. The term is the random noise at location
with mean zero and variance. We assume that ’s are independent across individuals and locations. Note that we allow the variance to be heterogeneous over space. In creftypecap 1, is the parameter of interests while and are nuisance parameters.
3.2 Neural network-based image-on-scalar regression
We propose a neural network-based approach for estimating the SVCFs in creftypecap 1
. Consider a general feed-forward neural network withhidden layers where the th layer () has hidden units, and and represent the input and output dimensions, respectively. We represent this neural network as , which is a vector-valued function with architecture and parameters . where ’s are the biases and ’s are the kernel weights from all the layers, and
represents the parameter space. The activation functions in all the neural networks are pre-specified.
Instead of directly fitting creftypecap 1, we consider a neural network-based image-on-scalar regression (NN-ISR) model as a working model for parameter estimation and feature selection. In particular, to approximate and in creftypecap 1, we introduce two neural network functions and with architectures and , respectively. For all the subjects, suppose the imaging measurements are collected at a standard template consisting of voxels, denoted as , where represents the center of voxel . Let and . The NN-ISR working model is
where , (with being the identity matrix), and ’s are independent across voxels. Write including all the variance parameters. Note that the model is completely determined by the parameters , and .
where and are tuning parameters. This in effect turns the number of voxels as the “sample size", which is key for our proposal to tackle a limited number of subjects. Then we estimate in model creftypecap 1 by
In practice, we suggest setting for a fixed as a rule of thumb to avoid overfitting the individual effects, and can be selected by cross validation, for which LASSO (Tibshirani, 1996) can be used in lieu of NN-ISR to reduce the computation burden.
Algorithm. We develop a voxel-batch hybrid gradient descent (VHGD) algorithm (Algorithm 1) to solve the optimization problem in creftypecap 3. Instead of sub-sampling the subjects, VBGD creates mini-batches by sub-sampling the voxels to implement the stochastic gradient descent (SGD) algorithm. In the beginning, we update the parameters by using SGD with a small voxel-batch size and a high learning rate. After SGD has converged, we increase the batch size, decrease the learning rate, and run SGD again. We repeat this procedure until the batch size has reached the total number of voxels, in which case SGD becomes the standard gradient descent. This hybrid of stochastic and deterministic gradient descent algorithm by using voxel-batches with an increasing batch size can be interpreted as learning the imaging patterns with gradually improving imaging resolutions, leading to a fast decrease of the loss function in the beginning and fine-tuned results at the end. For notations, let denote all the imaging data across voxels, and let be the objective function in creftype 3 for a voxel-batch with indices .
Selection. To select important brain activation regions that are strongly associated with each covariate, we adopt a hard thresholding procedure. Define . Let be a vector of positive thresholding values. For covariate , the selected activation region is . In theory, for any arbitrarily stringent sign error requirement, there always exists a threshold to guarantee the sign error bound (see Theorem 2 in Section 3.3
). In practice, we may equivalently threshold the quantiles of, and the thresholding value can be chosen based on the proportion of signals detected by MUA.
3.3 Theoretical Properties
For the proposed NN-ISR method, we study the mean squared error (MSE) for parameter estimation and probability of large sign error (PLSE) for feature selection. For a finite sample size problem, we show that both MSE and PLSE can be made arbitrarily small provided that the imaging resolution is sufficiently large and the neural network is sufficiently flexible. Moreover, in the case of one-hidden-layer neural networks with the number of hidden nodes large enough, the errors converge linearly (up to a logarithmic term) with respect to the number of voxels.
Theorem 1 shows that NN-ISR’s main effect estimation convergence rate is bounded above by the convergence rate of training a general neural network of the same architecture with the sample size proportional to the of the number of voxels. Theorem 2 proves that the selection PLSE converges at the same rate. Theorem 3 establishes the convergence rate for general neural networks with one hidden layer and one output, based on the results for random designs in (Barron, 1991, 1994; McCaffrey and Gallant, 1994). Finally, NN-ISR’s convergence rate for the special case of the independent one-hidden-layer neural networks is presented in Corollary 4. The regularity conditions and proof of each theorem are shown in the Supplementary Materials. We begin with the following notations and definitions.
Suppose are independent samples from the model for , where is the function of interest and . To estimate , train a neural network with the architecture and obtain the weight parameter estimate where and is a regularization function satisfying . Define as an error bound of training to estimate , that is,
We first show that for the NN-ISR estimator, the number of the voxels is the “sample size” for neural network training. We proved the following theorem by showing that the optimal neural network weight parameters that minimizes creftypecap 5 is close to the parameters that best fit the MUA estimate, whose converge rate depends on the number of voxels and the network architecture.
In fact, the same error bound holds for model selection PLSE. Let be the event indicator ( if occurs and otherwise). Define the hard thresholding function for any and . Let . Define .
For any there exists a such that for any with ,
Theorems 2 and 1 show the relation between NN-ISR’s estimation and selection convergence rates to the convergence rate of training neural networks of the same architecture in general, and most importantly, they provide a formula for calculating the sample size. This result can be applied to neural networks of any architecture with a proven error bound. For example, the following theorem establishes the convergence rate of training a general one-hidden-layer neural network, which consequently provides a specific error bound of NN-ISR for this particular case.
For a one-layer neural network with architecture , the error bound is
(Barron, 1991, 1994; McCaffrey and Gallant, 1994) first showed this convergence rate for random design models. In their works, the neural network’s input samples are randomly drawn from an unknown distribution, instead of being fixed a priori, and the error is integrated over all the input space instead of being averaged over the observed input samples. Based on the proof in (Barron, 1991), we proved that the same convergence rate holds for our fixed design model in creftypecap 1 and creftypecap 2, if we change the continuous measure of error to the discrete version in Definition 1, which is equivalent to a Riemann sum of the former. The difference between the two types of error can be shown to be negligible as the number of voxels goes to infinity, if we add further assumptions, such as an error bound on the second-order Taylor series expansion of .
4.1 Synthetic data analysis
Setup. We synthesized 50 or 100 images of dimension from 3 covariates according to creftypecap 1, where and . For the main effects and individual effects, we designed two patterns: 1) the piecewise continuous (PT) design where the activation regions involve mixtures of sharp edges and smooth transitions, and 2) the piecewise constant (PS) design involving many discontinuous jumps, which was adopted by (Zhu et al., 2014). Details are provided in the Supplementary Materials. Example slices of the true main effects of the two designs are shown in Figure 1
. We evaluated the performance of NN-ISR compared with MUA and SVCM. Estimation accuracy was measured by the median estimation MSE of the main effects, and the inter-quadrantic range (IQR) of the MSE was recorded to evaluate method stability. Selection accuracy was measured by the average receiver operating characteristic curve (ROC) and its area under curve (AUC). For NN-ISR, we used neural networks of architecturefor each main effect and for each individual effect. The regularization parameters for the main and individual effects are and , respectively.
Results. NN-ISR produced the least noisy estimate compared with MUA and SVCM, as seen in the example in Figure 1. Quantitatively, NN-ISR achieved the lowest MSEs and narrowest IQRs for estimating the main effects in all experimental settings (Table 1). For selection accuracy, NN-ISR had the highest AUCs in all the cases, where a selection of the ROC plots are shown in Figure 2
. The advantage of NN-ISR was the greatest in the setups with the most noisy and skewed data.
|M = 50||M = 100||M = 50||M = 100|
4.2 Analysis of fMRI data
Objectives. Our work was motivated by the needs of analyzing fMRI data from two large neuroimaging consortia: ABIDE (Di Martino et al., 2014) and ABCD (Casey et al., 2018). Both studies collected fMRI data from multiple experimental sites across the United States, along with rich clinical data including the cognitive ability assessment scores. Our analysis focused on the association between brain activity and cognitive ability. We selected the consistent activation regions where the variation of the brain activity could be explained by the cognitive ability across multiple subjects. We also assessed the reliability on the model inferences across the experimental sites.
ABIDE focused on understanding the neural bases of autism and the associated cognitive behaviors. In Phase I, ABIDE aggregated 20 resting-state fMRI datasets from 17 experimental sites involving 1,112 subjects. In our analysis, we used the standard pipeline (Craddock et al., 2013) that has been widely adopted (He et al., 2019) to prepossess the fMRI data. After removing the missing values, the dataset consisted of 821 subjects. The response image was the weighted degrees (WDC) from the resting fMRI data, which measured local brain functional connectivity and identified the most connected voxels by counting the number of direct connections (edges) to all other voxels. Cognitive ability was measured by the full-scale intelligence quotient (FIQ) score. Other covariates included age, sex and the autism diagnostics.
ABCD aimed at studying the association between cognitive behaviors and brain development. This study recruited more than 11,800 children of age 9-10, collecting a comprehensive set of cognitive assessments and multimodal neuroimaging data from 21 experimental sites. Our analysis used minimally preprocessed -back fMRI data of 1991 subjects from the curated ABCD annual release 1.1, whose details were described in (Hagler Jr et al., 2019). The response image was the contrast map (CM, 0-back) from the -back task fMRI data. In this task, reliable brain activation engaging memory regulation processes and cognitive functions were found across subjects. The 0-back contrast map reflected brain activity under low memory load conditions. Cognitive ability was measured by the general cognitive ability component score (Sripada et al., 2019). Age and sex were also included in the covariate list.
Cognitive ability-associated brain activation. We applied NN-ISR to both datasets and estimated the consistent brain activation that were strongly associated with cognitive ability across multiple subjects. To efficiently choose the optimal regularization weight for NN-ISR, we ran LASSO with 5-fold cross validation on the training samples and used the weight with the lowest prediction MSE as the regularization weight for NN-ISR. Four brain functional networks had high proportion of selected voxels: visual (Calcarine_L, Cuneus_R, Lingual_L&R) default mode (Occipital_Mid_R, Temporal_Mid_R, Frontal_Med_Orb_L&R), dorsal attention (Temporal_Mid_R, Parietal_Sup_L, Postcentral_L), and ventral attention (Temporal_Mid_L, Temporal_Sup_L&R) as shown in Figure 4. These networks have been identified to be associated with intelligence in existing literature (Hearne et al., 2016; Wu et al., 2013; Hilger et al., 2017; Van Den Heuvel et al., 2009). In comparison, MUA and SVCM were unable to detect signals for the default mode and dorsal attention networks in ABIDE.
Model inference reliability comparisons. To compare the reliability of parameter estimation and region selection between NN-ISR, MUA and SVCM, we considered a cross-site validation approach for each study, where the corresponding samples were split by their experimental sites. Each time we selected only one site for training and used the rest for testing. We evaluated reliability based on two criteria: estimation accuracy and selection contagiousness. Estimation accuracy was measured by the prediction R-squared of the test images, as reported in Figures 2(c) and 2(a). NN-ISR achieved the highest medians and narrowest IQRs compared with the other two methods. The difference is more distinct in ABIDE than that in ABCD. Selection contiguousness was evaluated by the number and average size of contiguous activation regions. In neuroimaging studies, larger activation regions with stronger signals are more reliable and reproducible. Figures 2(d) and 2(b) show the number of contiguous slection clusters versus the log mean cluster sizes. The NN-ISR selected fewer regions with larger sizes compared to MUA and SVCM. These results indicate that NN-ISR extracted more useful spatial location information than the other methods for model inferences.
In this work, we have presented a novel method (NN-ISR) that utilizes neural networks to estimate SVCFs and select activation regions in image-on-scalar regression. The functional space of the neural network-based estimator is sufficiently large to approximate a wide range of sparse and piecewise-continuous SVCFs. Our proposed algorithm (VHGD) combines the deterministic and stochastic gradient descent updating schemes, leading to a fast decrease in the loss function in the beginning and fine-tuned estimates at the end. As a key advantageous theoretical property, the mean squared error and the probability of large sign error of the NN-ISR estimator can be reduced by improving the imaging resolution without increasing the sample size. In a special case of one-hidden-layer neural networks, both error shrink almost linearly toward zero as the number of voxels goes to infinity. We have demonstrated that NN-ISR outperforms the existing alternatives via extensive experiments on synthetic data and the analyses of the fMRI data in two large neuroimaging consortium studies. In addition, our model and algorithm are conceptually straightforward and can be easily implemented and modified on common deep learning platforms. The source code and the software will be publicly available.
This work porposes a novel model and algorithm that enable researchers to study the effects of scalar variables on functional data with higher accuracy. For the machine learning and statistics community, our framework connects simple statistical model with deep neural networks and opens a door for the development of interpretable deep learning methods. For the cognitive science and neuroscience community, we provide a powerful tool to neuroscientists, psychitrists, and psychologists for studying brain functions and structures.
Researchers who study mental health problems, such as neurlcognitive disorders and substance abuse, may adopt our method to facilitate imaging-based disease diagnostics and therapy. They need to setup the model correctly, specify the tuning parameters appropriretely, and interpret the fitting results accurately. Mis-using our method or mis-interpreting its results may potentially lead to inccorrect disease diagnostics and adverse treatment effects. Moreover, the quality of the inferences made by our model highly dependends on the quality of the training data. If the model is trained by samples with systematic bias, the inference results will also be biased and can lead to suboptimal clinical decisions. To address these concerns, we need to provide user-friendly software, detailed documentations, and even rigorous eduction training program in the future.
- Complexity regularization with application to artificial neural networks. In Nonparametric functional estimation and related topics, pp. 561–576. Cited by: §3.3, Remark.
- Approximation and estimation bounds for artificial neural networks. Machine Learning 14 (1), pp. 115–133. Cited by: §3.3, Remark.
- The control of the false discovery rate in multiple testing under dependency. Annals of Statistics, pp. 1165–1188. Cited by: §2.
Varying-coefficient models for geospatial transfer learning. Machine Learning 106 (9-10), pp. 1419–1440. Cited by: §2.
- The adolescent brain cognitive development (abcd) study: imaging acquisition across 21 sites. Developmental Cognitive Neuroscience 32, pp. 43–54. Cited by: §1, §4.2.
- Local region sparse learning for image-on-scalar regression. arXiv preprint arXiv:1605.08501. Cited by: §2.
- False discovery rate revisited: fdr and topological inference using gaussian random fields. NeuroImage 44, pp. 62–70. Cited by: §2.
- Towards automated analysis of connectomes: the configurable pipeline for the analysis of connectomes (c-pac). Frontiers in Neuroinformatics (42). External Links: Cited by: §4.2.
- Statistics for spatial data. Vol. 900, Wiley New York. Cited by: §2.
- The autism brain imaging data exchange: towards a large-scale evaluation of the intrinsic brain architecture in autism. Molecular Psychiatry 19 (6), pp. 659–667. Cited by: §1, §4.2.
- Model-based geostatistics. Journal of the Royal Statistical Society: Series C (Applied Statistics) 47 (3), pp. 299–350. Cited by: §2.
- Statistical parametric maps in functional imaging: a general linear approach. Human Brain Mapping 2 (4), pp. 189–210. Cited by: §2.
- Spatial modeling with spatially varying coefficient processes. Journal of the American Statistical Association 98 (462), pp. 387–396. Cited by: §2.
- A simultaneous confidence corridor for varying coefficient regression with sparse functional data. TEST 23 (4), pp. 806–843. Cited by: §2.
- Image processing and analysis methods for the adolescent brain cognitive development study. NeuroImage 202, pp. 116091. Cited by: §4.2.
- A selective overview of feature screening methods with applications to neuroimaging data. Wiley Interdisciplinary Reviews: Computational Statistics 11 (2), pp. e1454. Cited by: §4.2.
- Functional brain networks related to individual differences in human intelligence at rest. Scientific Reports 6, pp. 32328. Cited by: §4.2.
- Efficient hubs in the intelligent brain: nodal efficiency of hub regions in the salience network is associated with general intelligence. Intelligence 60, pp. 10–25. Cited by: §4.2.
- Sparse learning and structure identification for ultra-high-dimensional image-on-scalar regression. Journal of the American Statistical Association (just-accepted), pp. 1–27. Cited by: §2.
- Convergence rates for single hidden layer feedforward networks. Neural Networks 7 (1), pp. 147–158. Cited by: §3.3, Remark.
- Adaptive weights smoothing with applications to image restoration. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 62 (2), pp. 335–354. Cited by: §2.
- Jump surface estimation, edge detection, and image restoration. Journal of the American Statistical Association 102 (478), pp. 745–756. Cited by: §2.
- Thresholded multiscale gaussian processes with application to bayesian feature selection for massive neuroimaging data. arXiv preprint arXiv:1504.06074. Cited by: §2.
- Prediction of neurocognition in youth from resting state fmri. Molecular Psychiatry, pp. 1–9. Cited by: §4.2.
- Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58 (1), pp. 267–288. Cited by: §3.2.
- Efficiency of functional brain networks and intellectual performance. Journal of Neuroscience 29 (23), pp. 7619–7624. Cited by: §4.2.
- Topological organization of functional brain networks in healthy children: differences in relation to age, sex, and intelligence. PloS One 8 (2). Cited by: §4.2.
- Adaptive spatial smoothing of fmri images. Statistics and its Interface 3, pp. 3–13. Cited by: §2.
- Spatially varying coefficient model for neuroimaging data with jump discontinuities. Journal of the American Statistical Association 109 (507), pp. 1084–1098. Cited by: §2, §3.1, §4.1.