1 Introduction
There has been significant research activity aimed at the association between image data and other scalar variables (e.g. cognitive score, diagnostic status) in the study of neurodegenerative and neuropsychiatric diseases, such as Alzheimer’s disease (AD)[10]
. The growing public threat of AD has raised the urgency to discover ROI of magnetic resonance images (MRI) that may identify subjects at greatest risk for future cognitive decline and accelerate the testing of preventive strategies. Machine learning methods have been developed and the penalized optimization is popular in the framework of the empirical risk minimization plus a penalty. However, spatially heterogeneous smoothness and local region selection greatly complicates the image analysis. To address these challenges, several regularization methods have been proposed to impose sparsity on both pixel values and their spatial derivatives. For instance, GraphNet
[5] combines the Lasso penalty and an penalty of image gradients, and TV [4, 9] uses a weighted combination of the Lasso penalty and the TV penalty.It is wellknown that both Lasso and TV models have the inherent bias and often lead to less stable predictions [16]. For example, the spatially adaptive TV model [15] was proposed to remove the inherent bias in the TV model by utilizing a spatially varying weight function that is inversely proportional to the magnitude of image derivatives. It is a twostep procedure where the weight function obtained from the first step using standard TV is then used to guide smoothing in the second step. It is of interest to note that, in the statistical literature, the Smoothly Clipped Absolute Deviation (SCAD) penalty [6]
has been proposed in the context of highdimensional linear regression to address the shortcomings of Lasso (which is not consistent in variable selection). SCAD has some desired properties of the estimator such as continuity, asymptotic unbiasedness, sparsity, and the socalled oracle property (which behaves the same as when the zero coefficients are known in advance). There are a few papers on the use of SCAD for image analysis
[3, 9]. None of them consider the local region learning. We will adapt the SCAD penalty for our local region selection problem in the framework of imageonscalar regression.In this paper, we propose a novel regularization method called SCAD2TV, which combines the SCAD regularization, enforcing sparsity, and the SCAD of TV regularization, enforcing spatial contiguity, into one group, which segments contiguous spatial regions against zerovalued background. This paper makes three main contributions:

The new penalty, SCAD2TV, forces zeros on coordinates and spatial derivative jointly, which makes it easy to identify ROI for the imageonscalar regression model. It solves the bias issue inherent in LASSO or TV methods.

Our proposed algorithms are based on ADMM, which decomposes a nonconvex problem with the nonconvex penalty into two iterative optimization problems with explicit solutions. The divide and conquer learning algorithm is also developed, thereby allowing scaling to large images.

Compared with GraphNet and TV, SCAD2TV has better or competitive performance in either prediction or selection errors.
2 ImageonScalar Regression and SCAD2TV
2.1 ImageonScalar Regression
Regression models with image responses and scalar predictors are routinely encountered in many applications [2, 7]. Consider an imageonscalar regression model with varying coefficients: , where is the covariate, is the image response at pixel (a 2D or 3D domain), and
is the coefficient image vector. Here
is a zeromean spatial field which characterizes the spatial correlation, andis the white noise with mean zero and variance
. In this paper, we focus on a 2D image. Extension to 3D images is straightforward. The objective is to identify ROI in the response image which are associated with the corresponding covariate by estimating the coefficient images . The available data are image and covariate pairs for subjects, , . We obtain the estimator by minimizing(1) 
where is a penalty function which favors estimators according to certain criteria. Our purpose is to recover nonzero active regions of . The main challenges are that we need to impose sparsity of pixel values and extract active regions simultaneously.
2.2 Existing Regularizers
TV and SCAD. The TV analysis plays a fundamental role in various image analyses since the pathbreaking works [13, 12]. We focus on the anisotropic version of TV. For , define the discrete gradient is defined by
The TV norm is just . The isotropic induced TV norm is , which is equivalent to the anisotropic induced TV norms up to a factor of .
The SCAD penalty is more conveniently defined its derivative
and . We use by convention. Consider a penalized least squares problem: minimize . The solution is unique, explicit, and , where is the thresholding function. Figure 1 displays the thresholding function for SCAD and the soft thresholding function for Lasso with and . The SCAD penalty shrinks small coefficients to zero while keeping the large coefficients without shrinkage.
GraphNet and TV. GraphNet and TV have been successful applied to medical images. GraphNet is the weighted average of an penalty on all coordinates and a squared penalty on the discrete gradient, while TV is the weighted average of an penalty and a TV penalty:
For both penalties, is the smoothing parameter which controls the strength of regularization and is another smoothing parameter controlling the tradeoff between pixel sparsity and spatial regularity.
2.3 A New Penalty: SCAD2TV
For each coordinate , the discrete gradient involves three coordinate values . Let . The SCAD2TV penalty is defined by
(2) 
where and are two tuning parameters, and is the SCAD function. The first term in the penalty allows adaptive estimation of the coefficient image and the second one enforces sparsity on coordinate values. One may also consider the functional version of (2). After some rescaling, (2) is equivalent to
(3) 
The SCAD2TV solves the bias problem inherent in the TV and Lasso models. Note that this penalty function, unlike the penalty used in Lasso, is not convex, so that (1) is a nonconvex objective function. We solve this problem based on the ADMM and convert it into two subproblems with closedform solutions. In general, ADMM has successful applications to convex problems. The behavior of ADMM applied to nonconvex problems has been a mystery. Recently, the global convergence of ADMM in nonconvex optimization is discussed in [17], which shows that several ADMM algorithms including SCAD are guaranteed to converge.
3 Local Region Learning by SCAD2TV
3.1 Algorithm based on ADMM
Our proposed algorithm is based on ADMM [1]. We may write (1) as the matrix form by an abuse of notation:
(4) 
where is the veritorized response image for subject , is the fixed extended design matrix related to the covariate for subject , and is the concatenated vectorized unknown coefficient image. Furthermore, one of the advantages of SCAD2TV in (2) is that we can write as for a fixed by matrix depending only on , which greatly facilitates the efficiency of our algorithm. This fact can be easily seen since the elements involved in the term in (2) are
for a fixed matrix . So is the concatenated version of .
Problem (4) is equivalent to
We form the augmented Lagrangian as
The ADMM consists of the iterations
(5)  
(6)  
(7) 
It should be emphasized that both (5) and (6) have the explicit solutions. Specifically, (5
) is a ridge regression problem and (
6) is a penalized least squares problem with the identity design matrix. The closedform solutions for (5) and 6 are, respectively,The details of the algorithm is summarized in Algorithm 1.
The output is either or . Note that is a sparse solution and may not be sparse. In practice, we extract the coefficient image estimator from the output to obtain the sparse estimator.
3.2 Divide and Conquer Learning Algorithm for Large Image Size
To address the big data issue, a divide and conquer (D&C) algorithm is a solution by recursively breaking down a problem into two or more subproblems of the same or related type, until these become simple enough to be solved directly. The solutions to the subproblems are then combined to give a solution to the original problem. In the above discussion, we assume that, for each subject , all image coordinate values are used together to infer the coefficient images . However, in many applications may be large and such a “batch” procedure is undesirable. In order to solve this issue, we develop a D&C algorithm for large image size.
Image data have their intrinsic structure and we need proceed the divide step with extra caution. For example, we may just partition each image as nonoverlap subimages , with the data processed sequentially. Due to the TV term in SCAD2TV, we lose the boundary information for all subimages and this will give poor estimates of the boundary for all subimages. We propose to partition with overlapped subimages . For instance, Figure 2 displays a image, and it is straightforward to partition them into nine nonoverlapped subimages. For our purpose, we extend each subimage in four directions. Specifically, on the left up corner, we include both the light blue part and the light yellow part, which makes our first subimage a image. In the center, we take the inside orange part together with the brown part, which makes it a image. After this partition, we obtain nine overlapped subimages. We perform Algorithm 1 on each subimage and update the coefficient images. For each update we only keep the estimate for the original subimage.
The above D&C algorithm can be executed sequentially in a single machine. This algorithm is naturally adapted for execution in multiprocessor machines, especially sharedmemory systems where the communication of data between processors does not need to be planned in advance, because distinct subproblems can be executed on different processors.
4 Empirical Results
4.1 Synthetic data
We design a synthetic data example to compare the performance among three approaches: SCAD2TV, GraphNet, and TV in terms of both prediction and selection errors.
Data Generation. In our setting, where each is a image (See the left panel of Figure 3). The covariate is and each
is generated from a uniform distribution between
and . The spatial field is generated from a zero mean Gaussian random field. The error process is the white noise with mean zero and variance . Two noise levels are adopted at . The sample size is .Applying SCAD2TV. In order to examine the performances of three methods, SCAD2TV, GraphNet, and TV, we have generated datasets for each setting. For each dataset, we obtain the coefficient image estimates from these three methods. The selection rate is define as
(8) 
and the mean squared error is defined as
(9) 
where is the total number of pixels and the are the predicted images.
Practical Consideration. Smoothing parameters can be selected by using the Kfold crossvalidation (CV). However, its computational time can be long even under current computing facilities. In our experiment, we have tested a few different values for the tuning parameters such as and . We find is a good balance for the estimation. The value of is related to our expectation of ROI. If the ROI has a sharp boundary and the values do not change much inside ROI, we can use a large . Otherwise, a smaller would be preferred. We choose , and .
Results. For each setting, the experiments using SCAD2TV, GraphNet, TV are repeated times. Figure 3 displays the estimates of coefficient images from one realization. We note that SCAD2TV provides the solution almost exact the same as the truth. TV can keep the sharp boundary but provide biased estimates inside the active zone. GraphNet displays blurred estimates for both active zone and zero subregions. The average of the selection rates and the MSEs are reported in Table 1. It is noted that, in terms of the selection rate, SCAD2TV performs better consistently than the other two methods. On the other hand, in terms of the prediction error, SCAD2TV and TV are similar to each other, and GraphNet gives the highest MSE.
SCAD2TV  TV  GraphNet  

SR  MSE  SR  MSE  SR  MSE  
= 1  0.820  1.97  0.783  1.98  0.534  1.98  
0.728  0.740  0.583  
0.677  0.672  0.508  
= 0.1  0.9995  0.020  0.9675  0.024  0.8188  0.044  
0.9983  0.9417  0.8210  
0.9990  0.9070  0.8472 
4.2 Hippocampus Data
Dataset. To illustrate the usefulness of our proposed model, consider anatomical MRI data collected at the baseline by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) study, which is a large scale multisite study collecting clinical, imaging, and laboratory data at multiple time points from healthy controls, individuals with amnestic mild cognitive impairment, and subjects with Alzheimer’s disease. Given the MRI scans, hippocampal substructures were segmented with FSL FIRST [11] and hippocampal surfaces were automatically reconstructed with the marching cube method [8]. We adopted a surface fluid registration based hippocampal subregional analysis package [14], which uses isothermal coodinates and uid registration to generate onetoone hippocampal surface registration for surface statistics computation.
In the dataset, we have total observations. For each subject, it includes a 2D representation of left hippocampus and covariates: gender (female=0 and male=1), age (5592), disease status (control=0 and AD=1), and behavior score (136). The goal is to identify local regions of the response image associated with each covariate.
Applying SCAD2TV. We have applied our D&C learning algorithm to this dataset. We divide each response image into overlapped subimages. We execute the algorithm sequentially in a single machine. The algorithm spends about 10 secs for each partition, and takes 25 minutes to get the final estimation of ’s. The estimated coefficient images are presented in the top panel of Figure 4.
Results. Our purpose is to identify the local regions where the response hippocampus image is associated with each individual covariate. Among all of the covariates, we are particularly interested in the association between the response hippocampus image and the disease status (Control vs AD). From the top panel of Figure 4, it is interesting to notice that gender has no effect on the response image, and for other three covariates SCAD2TV has been successfully identify the local active regions.
We take a close investigation on the coefficient image corresponding to the disease status. The left panel of Figure 5 displays the estimate of . The subregions in red indicate the active zone and the region in orange is the zero subregion. In general, the AD patients have lower pixel values in the response hippocampus image. The right two panels are the mean response images for both health control and AD. The mean difference is consistent with our estimation of .
We extract the pixels within the ROI for health controls and AD, and apply hypothesis testing to test if their difference is significant. By applying hypothesis testing on each pixel in the ROI, all of them are different between health controls and AD at the significance level , and of the pixels in the ROI are different at the significance level . The result justifies our ROI selection is indeed the region to differentiate between health controls and AD.
Comparison with TV and GraphNet. We obtain the estimates of the coefficient images for TV and GraphNet, which are presented in the middle and bottom panels of Figure 4. These three methods overall detects similar regions. Both TV and GraphNet display more blocky active regions, whereas SCAD2TV keep the active zone with sharp boundaries. We also divide our dataset into 5 parts to compare the prediction performance where each dataset contains around 80 observations. Every time we use 4 of them as the training data, and make prediction on the remaining testing data. The averages of the MSEs are computed for each methods, which are reported in Table 2. The MSEs are similar to each other and SCAD2TV displays a slightly better prediction power.
SCAD2TV  TV  Graphnet  

MSE  0.5476  0.5482  0.5491 
5 Conclusion
We have introduced a new regionselecting sparse nonconvex penalty, SCAD2TV, which enforces large regions of zero subimages and extracts nonzero active zones simultaneously. Efficient algorithm and the distributed algorithm have been developed. Numerical examples are presented and the experimental results are superior or competitive with other stateoftheart approaches such as GraphNet and TV. We have sofar focused on 2D images. It should be noted that our method works for 3D images as well. We are currently implementing our algorithm in the distributed platform such as Apache Spark. We have discussed the application for imageonscalar regression models. This new framework may also be applied to the image clustering and image classification problems, which assume that only small regions of the images have significant effects on clustering and classification.
References
 Boyd et al. [2011] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1):1–122, 2011.
 Chen et al. [2015] Y. Chen, J. Goldsmith, and T. Ogden. Variable selection in functiononscalar regression. 2015.
 Chopra and Lian [2010] A. Chopra and H. Lian. Total variation, adaptive total variation and nonconvex smoothly clipped absolute deviation penalty for denoising blocky images. Pattern Recognition, 43(8):2609–2619, 2010.

Dohmatob et al. [2014]
E. D. Dohmatob, A. Gramfort, B. Thirion, and G. Varoquaux.
Benchmarking solvers for tvℓ 1 leastsquares and logistic regression in brain imaging.
In Pattern Recognition in Neuroimaging, 2014 International Workshop on, pages 1–4. IEEE, 2014.  Eickenberg et al. [2015] M. Eickenberg, E. Dohmatob, B. Thirion, and G. Varoquaux. Grouping total variation and sparsity: Statistical learning with segmenting penalties. In Medical Image Computing and ComputerAssisted Intervention–MICCAI 2015, pages 685–693. Springer, 2015.
 Fan and Li [2001] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348–1360, 2001.

Goldsmith et al. [2015]
J. Goldsmith, V. Zipunnikov, and J. Schrack.
Generalized multilevel functiononscalar regression and principal component analysis.
Biometrics, 71(2):344–353, 2015.  Lorensen and Cline [1987] W. E. Lorensen and H. E. Cline. Marching cubes: A high resolution 3d surface construction algorithm. In ACM siggraph computer graphics, volume 21, pages 163–169. ACM, 1987.
 Mehranian et al. [2013] A. Mehranian, H. S. Rad, A. Rahmim, M. R. Ay, and H. Zaidi. Smoothly clipped absolute deviation (scad) regularization for compressed sensing mri using an augmented lagrangian scheme. Magnetic resonance imaging, 31(8):1399–1411, 2013.
 Mu and Gage [2011] Y. Mu and F. H. Gage. Adult hippocampal neurogenesis and its role in alzheimer’s disease. Molecular neurodegeneration, 6(1):1–9, 2011.
 Patenaude et al. [2011] B. Patenaude, S. M. Smith, D. N. Kennedy, and M. Jenkinson. A bayesian model of shape and appearance for subcortical brain segmentation. Neuroimage, 56(3):907–922, 2011.
 Rudin et al. [1992] L. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation noise removal algorithm. Phys D, 60:259–268, 1992.
 Rudin and Osher [1994] L. I. Rudin and S. Osher. Total variation based image restoration with free local constraints. In Image Processing, 1994. Proceedings. ICIP94., IEEE International Conference, volume 1, pages 31–35. IEEE, 1994.
 Shi et al. [2013] J. Shi, P. M. Thompson, B. Gutman, Y. Wang, A. D. N. Initiative, et al. Surface fluid registration of conformal representation: Application to detect disease burden and genetic influence on hippocampus. Neuroimage, 78:111–134, 2013.
 Strong et al. [1997] D. M. Strong, P. Blomgren, and T. F. Chan. Spatially adaptive localfeaturedriven total variation minimizing image restoration. In Optical Science, Engineering and Instrumentation’97, pages 222–233. International Society for Optics and Photonics, 1997.
 Tibshirani [1996] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
 Wang et al. [2015] Y. Wang, W. Yin, and J. Zeng. Global convergence of admm in nonconvex nonsmooth optimization. arXiv preprint arXiv:1511.06324, 2015.