# Robust High-Dimensional Regression with Coefficient Thresholding and its Application to Imaging Data Analysis

It is of importance to develop statistical techniques to analyze high-dimensional data in the presence of both complex dependence and possible outliers in real-world applications such as imaging data analyses. We propose a new robust high-dimensional regression with coefficient thresholding, in which an efficient nonconvex estimation procedure is proposed through a thresholding function and the robust Huber loss. The proposed regularization method accounts for complex dependence structures in predictors and is robust against outliers in outcomes. Theoretically, we analyze rigorously the landscape of the population and empirical risk functions for the proposed method. The fine landscape enables us to establish both statistical consistency and computational convergence under the high-dimensional setting. The finite-sample properties of the proposed method are examined by extensive simulation studies. An illustration of real-world application concerns a scalar-on-image regression analysis for an association of psychiatric disorder measured by the general factor of psychopathology with features extracted from the task functional magnetic resonance imaging data in the Adolescent Brain Cognitive Development study.

## Authors

• 8 publications
• 139 publications
• 16 publications
• 12 publications
• 29 publications
07/05/2021

### Multivariate functional group sparse regression: functional predictor selection

In this paper, we propose methods for functional predictor selection and...
10/26/2021

### Phase I Analysis of High-Dimensional Multivariate Processes in the Presence of Outliers

One of the significant challenges in monitoring the quality of products ...
05/17/2019

### ACE of Space: Estimating Genetic Components of High-Dimensional Imaging Data

It is of great interest to quantify the contributions of genetic variati...
12/18/2018

### Robust functional ANOVA model with t-process

Robust estimation approaches are of fundamental importance for statistic...
11/12/2021

### High-Dimensional Functional Mixed-effect Model for Bilevel Repeated Measurements

The bilevel functional data under consideration has two sources of repea...
07/02/2018

### Generative discriminative models for multivariate inference and statistical mapping in medical imaging

This paper presents a general framework for obtaining interpretable mult...
12/25/2017

### Overcomplete Frame Thresholding for Acoustic Scene Analysis

In this work, we derive a generic overcomplete frame thresholding scheme...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Regression analysis of high-dimensional data has been extensively studied in a number of research fields over the last three decades or so. To overcome the high-dimensionality, statistical researchers have proposed a variety of regularization methods to perform variable selection and parameter estimation simultaneously. Among these, the

regularization enjoys the oracle risk inequality (Barron et al., 1999) but it is impractical due to its NP-hard computational complexity. In contrast, the regularization (Tibshirani, 1996) provides an effective convex relaxation of the regularization and achieves variable selection consistency under the so-called irrepresentable condition (Zhao and Yu, 2006; Zou, 2006; Wainwright, 2009). The adaptive regularization (Zou, 2006) and the folded concave regularization (Fan and Li, 2001; Zhang, 2010) relax the irrepresentable condition and improve the estimation and variable selection performance. The folded concave penalized estimation can be implemented through a sequence of the adaptive penalized estimations and achieves the strong oracle property (Zou and Li, 2008; Fan et al., 2014).

Despite these important advances, existing methods, including the (adaptive) regularization and folded concave regularization, do not work well when predictors are strongly correlated, which is the case especially in scalar-on-image regression analysis (Wang et al., 2017; Kang et al., 2018; He et al., 2018). This paper is motivated by the needs of analyzing the -back working memory task fMRI data in the Adolescent Brain Cognitive Development (ABCD) study (Casey et al., 2018). The task-invoked fMRI imaging measures the blood oxygen level signal that is linked to personal neural activities when performing a specific task. The -back task is a commonly used approach to making assessment in psychology and cognitive neuroscience with a focus on working memory. One question of interest is to understand the association between the risk of developing psychiatry disorder and features related to functional brain activity. We use the 2-back versus 0-back contrast map statistics derived from the -back task fMRI data as imaging predictors. We aim at identifying important imaging biomarkers that are strongly associated with the general factor of psychopathology (GFP) or “p-factor”. GFP is a psychiatric disorder outcome used to evaluate the overall mental health of a subject. In this application, it is expected that the irrepresentable condition of the regularization can be easily violated by strong dependence among high dimensional imaging predictors from fMRI data. To illustrate the presence of strong dependence among imaging predictors, in Figure 1, we plot the largest absolute value of correlation coefficients and the number of correlation coefficients that are or between brain regions. Obviously, there exists strong dependence among imaging predictors, so that existing methods may not have a satisfactory performance in the scalar-on-image analysis. See the simulation study in Section 4 for more details.

To effectively address potential technical challenges in the presence of such strongly correlated predictors, we consider a new approach based on the thresholding function technique. The rationale behind our idea is rooted in attractive performances given by various recently developed thresholding methods, including the hard-thresholding property of the regularization shown in Fan and Lv (2013). They showed that the global minimizer in the thresholded parameter space enjoys the variable selection consistency. Thus, with proper thresholding of coefficients, it is possible to significantly relax the irrepresentable condition while to address the strong dependence in the scalar-on-image regression analysis. Recently, manifested by the potential power of the thresholding strategy, Kang et al. (2018) studied a new class of Bayesian nonparametric models based on the soft-thresholded Gaussian prior, and Sun et al. (2019) proposed a two-stage hard thresholding regression analysis that applies a hard thresholding function on the initial -penalized estimator. To achieve the strong oracle property (Fan et al., 2014), Sun et al. (2019) required the initial solution is reasonably close to the true solution in aspect of

norm with high probability.

Robustness against outliers occurring from heavy-tailed errors is of great importance in the scalar-on-image analysis. Due to various limitations of used instruments and quality control in data preprocessing, fMRI data often involves many potential outliers(Poldrack, 2012)

, compromising the stability and reliability of standard regression analyses. fMRI indirectly measures neural activity by assessing blood-oxgen-level-dependent signals and its signal-to-noise ratio is often low

(Lindquist and others, 2008). Moreover, the complexity of fMRI techniques limits the capacity of unifying fMRI data preprocessing procedures (Bennett and Miller, 2010; Brown and Behrmann, 2017) to identify and remove outliers effectively. Standard regression analysis with contaminated data may lead to a high rate of false positives in inference, as shown in many empirical studies (Eklund et al., 2012, 2016). It is loudly advocated that potential outliers should be taken into account in the study of brain functional connectivity using fMRI data (Rosenberg et al., 2016). In the lens of robustness, many works have been proposed to study the high dimensional robust regression problem. El Karoui et al. (2013)

studied the consistency of regression with a robust loss function such as the least absolute deviation (LAD). In a high-dimensional robust regression,

Loh (2017) showed that the use of a robust loss can help achieve the optimal rate of regression coefficient estimation with independent zero-mean error terms. In addition, Loh (2018) showed that by calibrating with a scale estimator in the Huber loss, the regularized robust regression estimator can be further improved.

In the current literature of the high dimensional scalar-on-image regression, Goldsmith et al. (2014) introduced a single-site Gibbs sampler that incorporates spatial information in a Bayesian regression framework to perform the scalar-on-image regression. Li et al. (2015) introduced a joint Ising and Dirichlet process prior to develop a Bayesian stochastic search variable selection. Wang et al. (2017) proposed a generalized regression model in which the image is assumed to belong to the space of bounded total variation incorporating the piece-wise smooth nature of fMRI data. Motivated by these works, in this paper we first introduce a new integrated robust regression model with coefficient thresholding and then propose a penalized estimation procedure with provable theoretical guarantees, where the noise distribution is not restricted to be sub-Gaussian. Specifically, we propose to use a smooth thresholding function to approximate the discrete hard thresholding function and tackle the strong dependence of predictors together with the use of the smoothed Huber loss (Charbonnier et al., 1997) to achieve desirable robust estimation. We design a customized composite gradient descent algorithm to efficiently solve the nonconvex and nonsmooth optimization problem. The proposed coefficient thresholding method is capable of incorporating intrinsic group structures of high-dimensional imaging predictors and dealing with their strong spatial and functional dependencies. Moreover, the proposed method effectively improves robustness and reliability.

The proposed regression with the coefficient thresholding method results in a nonconvex objective function in optimization. In the current literature, it becomes an increasingly important research topic to obtain the statistical and computational guarantees for nonconvex optimization methods. The local linear approximation (LLA) approach (Zou and Li, 2008; Fan et al., 2014, 2018) and the Wirtinger flow method (Candes et al., 2015; Cai et al., 2016) directly have enabled to analyze the computed local solution. The restricted strong convexity (RSC) condition (Negahban et al., 2012; Negahban and Wainwright, 2012; Loh and Wainwright, 2013, 2017) and the null consistency condition (Zhang and Zhang, 2012) were used to prove the uniqueness of the sparse local solution. However, it still remains non-trivial to study theoretical properties of the proposed robust regression with coefficient thresholding. The nonconvex optimization cannot be directly solved by the LLA approach, and the RSC condition does not hold. Alternatively, following the seminal paper by Mei et al. (2018), we carefully study the landscape of the proposed method. We prove that the proposed nonconvex loss function has a fine landscape with high probability and also establish the uniform convergence of the directional gradient and restricted Hessian of the empirical risk function to their population counterparts. Thus, under some mild conditions, we can establish key statistical and computational guarantees. Let be the sample size, be the dimension of predictors, and the size of the support set of true parameters. Specifically, we prove that, with high probability, (i) any stationary solution achieves the oracle inequality under the norm when ; (ii) the proposed nonconvex estimation procedure has a unique stationary solution that is also the global solution when ; and (iii) the proposed composite gradient descent algorithm attains the desired stationary solution. We shall point out that both statistical and computational guarantees of the proposed method do not require a specific type of initial solutions.

The rest of this paper is organized as follows. Section 2 first revisits the irrepresentable condition and then proposes the robust regression with coefficient thresholding and nonconvex estimation. Section 3 studies theoretical properties of the proposed method, including both statistical guarantees and computational guarantees. The statistical guarantees include the landscape analysis of the nonconvex loss function and asymptotic properties of the proposed estimators, while the computational guarantees concern the convergence of the proposed algorithm. Simulation studies are presented in Section 4 and the real application is demonstrated in Section 5. Section 6 includes a few concluding remarks. All the remaining technical details are given in the supplementary file.

## 2 Methodology

We begin with revisiting the irrepresentable condition and its relaxations in Subsection 2.1. After that, Subsection 2.2 proposes a new high-dimensional robust regression with coefficient thresholding.

### 2.1 The Irrepresentable Condition and its Relaxations

In the high-dimensional linear regression

is an

-dimensional response vector,

is an deterministic design matrix, is a -dimensional true regression coefficient vector and is an error vector with mean zero. We wish to recover the true sparse vector , whose support satisfies that its cardinality . We use as the shorthand for the support set .

The irrepresentable condition is known to be sufficient and almost necessary for the variable selection consistency of the penalization (Zhao and Yu, 2006; Zou, 2006; Wainwright, 2009). Specifically, the design matrix should satisfy that

 maxj∈Sc |xTjXS(XTSXS)−1sgn(β⋆S)|≤(1−γ),

for some incoherence constant , This condition requires that the variables in the true support are weakly correlated with other variables that are not in the true support.

There are several versions of relaxation of the irrepresentable condition in the current literature. Zou (2006) used the adaptive weights ’s in the penalization and relaxed the irrepresentable condition as:

 maxj∈Sc 1^wj|xTjXS(XTSXS)−1^wS∘sgn(β⋆S)|≤(1−γ),

where denotes the Hadamard (or componentwise) product of two vectors. The folded concave penalization  (Fan and Li, 2001; Zhang, 2010) also relaxed the irrepresentable condition. Let be the folded concave penalty function. Given the local linear approximation and the current solution , the folded concave penalization relaxed this condition as:

 maxj∈Sc 1|˙pλ(~βj)||xTjXS(XTSXS)−1|˙pλ(~βS)|∘sgn(β⋆S)|≤(1−γ),

where is the subgradient of the function . Both the adaptive penalization and folded concave penalization utilized the differential penalty functions to relax the restrictive irrepresentable condition. The procedure of adaptive lasso and solving folded concave penalized problem using local linear approximation (LLA) both require a good initialization. Their dependence on the initial solution is inevitably affected by the irrepresentable condition. As a promising alternative, we consider the assignment of adaptive weights on the design matrix , which down-weights the unimportant variables. Consider the ideal adaptive weight function , where goes to when and goes to , otherwise. We propose the following nonconvex estimation procedure:

 minβ12nn∑i=1{yi−p∑j=1xijg(βj)βj}2+λ∥β∥1.

Let . The irrepresentable condition is now relaxed in this paper as follows:

 maxj∈Sc |g(βj)⋅xTjZS(ZTSZS)−1∘sgn(β⋆S)|≤(1−γ).

Given the ideal adaptive weight function , the proposed approach further relaxes the irrepresentable condition in comparison to those considered by existing methods. Unlike existing methods, we propose an integrated nonconvex estimation procedure without depending on initial solutions.

### 2.2 The Proposed Method

The weight function plays an important role in relaxing the irrepresentable condition. If we knew the oracle threshold , the hard thresholding function would be the ideal choice for the weight function . However, the discontinuity of is challenging for associated optimization. To overcome, we consider a smooth approximation of given by

 gτ,η(u)=hτ(u−η)+hτ(−u−η),

where . Since as when , we have as when . Figure 2 illustrates the smooth approximation of to when is small (e.g.,

Given the above smooth thresholding function , we propose the robust regression with coefficient thresholding:

 minβ1nn∑i=1L(yi−p∑j=1xijβjgτ,η(βj))subject to∥β∥2≤r, (1)

where following Mei et al. (2018) and Loh and Wainwright (2017) we assume that the regression coefficients are bounded in the Euclidean ball . Here, to tackle possible outliers, we choose to be a differentiable and possibly nonconvex robust loss function such as the pseudo Huber loss (Charbonnier et al., 1997) or Tukey’s biweight loss. For the ease of presentation, throughout this paper, we use the pseudo Huber loss of the following form:

 L(a)=ω2{√1+(a/ω)2−1},a∈R,ω∈R. (2)

Note that provides a smooth approximation of the Huber loss (Huber, 1964) and bridges over the loss and the loss. Specifically, is approximately an loss when is small and an loss with slope when is large. It is worth pointing out that the utility of the pseudo Huber loss essentially adds a weight on each observation by in the following form:

 w(xi)=√1+(yi−^yi)2/w2(yi−^yi)2,

where is a fitted value. Hence, outliers are down-weighted to alleviate potential estimation bias.

To deal with a large number of predictors, we propose the following penalized estimation of high-dimensional robust regression with coefficient thresholding:

 minβ{1nn∑i=1L(yi−p∑j=1xijβjgτ,η(βj))+pλ(|β|)}subject to∥β∥2≤r, (3)

where is a chosen penalty function such as the penalization or folded concave penalization. In the scalar-on-image analysis, often we need to incorporate some known group information into . Suppose that the coefficient vector is divided into separate subvectors . We can use the group penalty (Yuan and Lin, 2006) or sparse group penalty (Simon et al., 2013) in the penalized estimation (3). With , the sparse robust regression with coefficient thresholding and group selection can be written as follows:

 minβ{1nn∑i=1L(yi−p∑j=1xijβjgτ,η(βj))+λB∑b=1∥βb∥2}subject to∥β∥2≤r, (4)

which includes the penalization as a special example with . The penalized robust regression with coefficient thresholding (3) and (4) can be efficiently solved by a customized composite gradient descent algorithm with provable guarantees, and the details will be presented in Subsection 3.3.

###### REmark 1.

Both thresholding function and robust loss work together to relax the restrictive irrepresentable condition in the presence of possible outliers. Define as the weight matrix with . The irrepresentable condition for the -penalized robust loss can be written as

 maxj∈Sc |wj∘xTjAS(ATSAS)−1⋅sgn(β⋆s)|≤(1−γ),

where . We impose the weights on the row vectors of design matrix . Recall that the thresholding function adds the weights to the column vectors of . Now, with in (3), we add the entry-specific weight to with

 ~wij=w(xij)∘g(βj).

Thus, the irrepresentable condition is now relaxed as follows:

 maxj∈Sc |~wj∘xTjZS(ZTSZS)−1⋅sgn(β⋆s)|≤(1−γ),

where . By assigning column-wise weight, we allow strong dependence between variables in the true support and other variables; By assigning row-wise weight, we reduce the weights for outliers and enhance the robustness.

###### REmark 2.

The proposed method can be considered as the simultaneous estimation of regression coefficients and adaptive weights to improve the adaptive penalization (Zou, 2006), whose weights are usually solved from the initial solution or iterated solution. Let . Since is a continuous and injective function of (say, ), there is a unique such that holds. Hence, we may write . The minimization in the proposed high-dimensional robust regression with coefficient thresholding can be rewritten as follows:

 minξ{1nn∑i=1L(yi−p∑j=1xijξj)+λp∑j=1|ξj|gτ,η(G−1(ξ)j)}. (5)

However, it does not fall into the folded-concave penalized estimation (Fan and Li, 2001; Fan et al., 2014). The proposed method aims to address the presence of highly correlated predictors, whereas the nonconvex penalized estimation was motivated by correcting the bias of the -penalized estimation. Also, solving (5) is extremely challenging, since the denominator can be arbitrarily small when . In comparison, the thresholding function is bounded and the non-convex optimization in (4) is computationally tractable.

###### REmark 3.

The proposed method is also related to Bayesian methods (Nakajima and West, 2013b, a, 2017; Kang et al., 2018; Ni et al., 2019; Cai et al., 2019). In the scalar-on-image regression, Kang et al. (2018) proposed a soft-thresholded Gaussian process (STGP) that enjoys posterior consistency for both parameter estimation and variable selection. In this Bayesian framework, is assumed to follow an STGP, where the soft thresholding function is defined as

 gη(β)=sgn(β)(|β|−λ)1{|β|>λ}. (6)

Thus the regression model can be written as follows:

 y=Xsgn(β)(|β|−λ)1{|β|>λ}(β)+ε,

where is a realization of a Gaussian process. Compared to Kang et al. (2018), our proposed method is more robust to possible outliers, and uses a very different approach to incorporate the thresholding function that down-weights unimportant variables and achieves sparsity. The proposed method also enjoys both statistical and computational guarantees as detailed in Section 3, whereas the convergence rate of the posterior computation algorithm in Kang et al. (2018) is still unclear.

## 3 Theoretical Properties

This section studies the theoretical properties of our proposed method. After establishing the connection of our proposed method with the thresholded parameter space in Subsection 3.1, we present the landscape analysis and asymptotic properties in Subsection 3.2, and the computational guarantee for an efficient composite gradient descent algorithm in Subsection 3.3.

### 3.1 Connection with the Thresholded Parameter Space

It is known that -penalization enjoys the oracle risk inequality (Barron et al., 1999). Fan and Lv (2013) proved that the global solution of -penalization, denoted by , satisfies the hard thresholding property that is either or larger than a positive threshold whose value depends on and and tuning parameter . Specifically, Fan and Lv (2013) studied the oracle properties of regularization methods over the thresholded parameter space defined as

 Bη={β∈Rp: ∥β∥2

Given the thresholded parameter space , we introduce the high-dimensional regression with coefficient hard-thresholding as follows:

 minβ{1nn∑i=1L(yi−p∑j=1xijβjI{|βj|≥η})+pλ(|β|)}subject to∥β∥2≤r. (8)

This thresholded parameter space guarantees that the hard thresholding property is satisfied with proper and norm upper bounds. Suppose that global solutions to both (3) and (8) exist. It is intuitive that if is very small, an optimal solution to (3) should be “close” to that of (8). Apparently, converges to pointwisely almost everywhere as . However it is known that almost surely pointwise convergence does not guarantee the convergence of minimizers (e.g., (Rockafellar and Wets, 2009, Section 7.A)). Here we provide the following proposition to show that the global solution of the proposed method (3) converges to the minimizer of coefficient hard-thresholding in (8) with additional assumptions.

###### PRoposition 1.

Suppose that is a sequence of scale parameters in the weight function such that goes to as . Let be a global minimizer of (3) with . If

• and

• For any , there exists , such that

 max1≤j≤pP(||^βkj|−η|<δ)<ϵ,

then with arbitrary high probability, the sequence enjoys a cluster point , which is the global minimizer of the high-dimensional regression with coefficient hard-thresholding in (8).

###### REmark 4.

Conditions (i) and (ii) of Proposition 1 are mild in our setting. Condition (i) assumes that the threshold level is smaller than the minimal signal level and the norm of true regression coefficients is bounded. Condition (ii) assumes the magnitude of the estimated is bounded away from the threshold level with high probability, that is, , . Note that the true regression coefficients ’s are either zero or larger than the threshold level . As long as the estimator is consistent (see Theorem LABEL:theorem5), Condition (ii) can be satisfied.

Given Proposition 1, we can prove that the total number of falsely discovered signs of the cluster point , i.e., the cardinality , converges to zero for the folded concave penalization under the squared loss and sub-Gaussian tail conditions. Recall that the folded concave penalty function satisfies that (i) it is increasing and concave in , (ii) , and (iii) it is differentiable with for some . Define as the smallest possible positive integer such that there exists an submatrix of

having a singular value less than a given positive constant

. is named robust spark in (Fan and Lv, 2013) to ensure model identifiability. Note that is an upper bound on the total number of false positives and false negatives. The following proposition implies the variable selection consistency of the cluster point .

###### PRoposition 2.

Suppose we re-scale each column vector of the design matrix for each predictor to have -norm , and the error satisfies the sub-Gaussian tail condition with mean 0. and for some constant . And there exists constant such that . Let be the support of and minimal signal strength . Define as the global minimizer of the following optimization problem:

 minβ{1nn∑i=1(yi−p∑j=1xijβjI{|βj|≥η})2+pλ(|β|)}subject to∥β∥2≤r,∥β∥0≤κc/2. (9)

If the penalization parameter is chosen such that for some constant , then with probability , we have

 card({j:sgn(^β)≠sgn(β⋆)})≤Csλ2η−2/(1−Cλ2η−2),

where and are constants.

Propositions 1 and 2 indicate that the cluster point enjoys the similar properties as those of the global solution given by the -penalized methods. The proposed method mimics the -penalization to remove irrelevant predictors and improve the estimation accuracy as .

In the sequel, we will show that with high probability our proposed method has a unique global minimizer (See Theorem 2(b)). In this aspect, our proposed regression with coefficient thresholding estimator provides a practical way to approximate the global minimizer over the thresholded space.

### 3.2 Statistical Guarantee

After presenting technical conditions, we first provide the landscape analysis and establish the uniform convergence from empirical risk to population risk in Subsection 3.2.1 and then prove the oracle inequalities for the unique minimizer of the proposed method in Subsection 3.2.2.

Now we introduce the notation to simplify our analysis. Let be a shorthand of , and let on . Given fixed , we claim that is third continuously differentiable on its domain. After the direct calculations given in Lemma 1 of the supplementary file, we have the explicit upper and lower bounds for the derivatives of , i.e.,

 k0–––Ip×p≼DG(β)≼¯¯¯¯¯k0Ip×p, D2G(β)≼¯¯¯¯¯¯¯m0Ip×p×p, D3G(β)≼¯¯¯¯¯s0Ip×p×p×p,

where , , and . Here, and are uniform constants independent of . And represents that is semi positive definite.

For the group lasso penalty, denote by the non-zero groups and by the zero groups, respectively. Following the structure of classical group lasso, let and be the corresponding length of vector and , respectively. Let , , and , where is finite, not diverging with and .

We make the following assumptions on the distribution of predictor vector , the true parameter vector and the distribution of random error .

###### ASsumption 1.
• The predictor vector is -sub-Gaussian with mean zero and continuous density in , namely and for all .

###### THeorem 1.

Let be the -th iterated solution of Algorithm 1. There exist constants , and are independent of such that when we choose and , we can find one iteration that can be upper bounded by , such that there exists a subgradient we denote as :

 ∥∇ˆRn(^β(k))+λB∑b=1g(^β(k)b)∥2≤ϵ

where denotes the sub-differential of the group penalty function (Rockafellar, 2015).

###### REmark 5.

Theorem 3 provides a theoretical justification of the algorithmic convergence rate. More specifically, the proposed algorithm always converges to an approximate stationary solution (a.k.a. -stationary solution) at finite sample sizes. In other words, for the solution solved after iterations, the subgradient of the objective function is bounded by under the norm at finite sample sizes. When increases, the proposed algorithm will find the stationary solution that satisfies the subgradient optimality condition as . To better understand the algorithmic convergence in practice, we also provide the convergence plots and computational cost of the proposed algorithm in simulation studies in Section C of the supplementary file. Combining Theorems 2 and 3, we may obtain the convergence guarantee with high probability. From both theoretical and practical aspects, the proposed algorithm is computationally efficient and achieves the desired computational guarantee. Observing that the nice empirical gradient structure proved in Theorem 2, in Appendix Section C, we further provide Theorem 4, which is an extension of Theorem 3. It further shows that given the solution is sparse, the composite gradient descent algorithm guarantees a linear convergence rate of , with respect to the number of iterations.

## 4 Simulation Studies

This section carefully examines the finite-sample performance of our proposed method in simulation studies. To this end, we compare the proposed robust estimator with coefficient thresholding (denoted by RCT) to Lasso, Adaptive Lasso (denoted by AdaLasso), SCAD and MCP penalized estimators in eight different linear regression models (Models 1–8) and with Lasso, Group Lasso (denoted by GLasso), and Sparse Group Lasso (denoted by SGL) in two Gaussian process regression models (Models 9 and 10). Models 9 and 10 mimic the scalar-on-image regression analysis. We implemented the Lasso and the Adaptive lasso estimators using R package ‘glmnet’, and chose the weight for adaptive lasso estimator based on the Lasso estimator. Group lasso is implemented using the method in (Yang, 2015). SCAD and MCP penalized estimators are implemented using R package ‘ncvreg’. And we also verify that the estimation results are consistency with R package ‘Picasso’.

We compare estimation performance based on the root-mean-square error (RMSE, that is ) and variable selection performance based on both false positive rate (FPR) and false negative rate (FNR). , and , where TN, TP, FP and FN represent the numbers of true negative, true positive, false positive and false negative respectively. Each measure is computed as the average over independent replications.

Before proceeding to simulation results, we discuss the choice of tuning parameters for the proposed method. The proposed method involves tuning over the penalization parameter , coefficient thresholding parameter , step size , and radius of the feasible region. Here, is chosen to be relatively small such that the algorithm does not encounter overflow issues. To simplify the tuning process, we set to be a small constant in our empirical studies; alternatively, can also be chosen according to an acceleration process in Nesterov (2013) to further achieve faster convergence. Also, is chosen to be relatively large to specify the appropriate feasible region, and can be specified as any arbitrary quantity that is smaller than the minimal signal strength. In addition, is chosen to be properly small to make our soft-thresholding function mimic the true thresholding function. It is important to point out that the algorithmic convergence and numerical results are generally robust to the choices of , , and . In our simulation studies, we choose , , . With these prespecified , , and , we choose the penalization parameter and using the -folded cross-validation based on prediction error. Additionally, can also be chosen as the quantile of the absolute values of non-zero coefficients of Lasso.

### 4.1 Linear Regression

First, we simulated data from the linear model where . We consider different correlation structures for in the following six simulation models:

 Model 1:σij=0.5|i−j|, AR1(0.% 5) Model 2:σij=0.6|i−j|, AR1(0.% 6) Model 3:σij=0.7|i−j|, AR1(0.% 7) Model 4:σij=0.4+0.6I(i=j), CS(0.4) Model 5:σij=0.5+0.5I(i=j), CS(0.5) Model 6:σij=0.6+0.4I(i=j), CS(0.6).

Models 1–3 have autoregressive (AR) correlation structures, in which the irrepresentable condition holds for Model 1 but fails for Models 2 and 3. Models 4-6 have the compound symmetry (CS) correlation structures and the irrepresentable condition fails in all these models.

We assume a mixture model for the error, , where is set much larger than . For Models 1–3, set and or in case (a), (b) or (c), respectively. For Models 4–6, set and , or in case (a), (b) or (c), respectively. For all the models, we choose , and to create a high dimensional regime.

Tables 1 and 2 summarize the simulation results for Models 1–6. Several observations can be made from this table. First, in Models 1–3, as the auto correlation increases, it becomes clearer that our robust estimator with coefficient thresholding (RCT) outperforms both Lasso ,Adaptive Lasso (AdaLasso) and nonconvex penalized estimators (SCAD and MCP). Nonconvex penalized estimators do not work well on all these settings, since they tend to choose too-large penalization, and it results in a much higher false negative rate than other methods. The Lasso estimator misses many true signals due to the high correlated , leading to that adaptive lasso also cannot perform well given the Lasso initials. Especially in Model 3, when the auto correlation is high, our thresholding estimator has much smaller FPR and FNR compared to lasso-type methods. Second, in case (a), where the effects of outliers are stronger than (b) and (c), our estimator achieves a relatively better performance than other two estimators thanks to the use of the pseudo Huber loss. Third, in more challenging Models 4–6, our estimator is able to identify more true predictors and well controls false positives. In summary, the proposed estimator is more favorable in high dimensional regression setting, especially when the predictors are highly correlated, such as AR1(0.7) model.

As we introduced in previous sections, nonconvex penalization such as MCP penalized regression does not require irrepresentable condition. However, in our simulation, we found that it does not work well in our setting. We find the major reason is that it does not work not as well as our estimator when dimension is too large comparable the sample size: , in our case. Here we summarize the regression simulation settings for major nonconvex penalized regression papers. Breheny and Huang (2011) has and

for linear regression setting. It has larger dimension of features under logistic regression setting, but nonconvex penalized estimator performs no better than

penalized estimator in the aspect of prediction. Zhang (2010) has and , with features being generated independently, in which the irrepresentable condition is not violated. Fan et al. (2014) used a setting of and with AR1(0.5) covariance matrix, which does not violate the irrepresentable condition. Loh and Wainwright (2017) used and various different sample sizes larger than . Wang et al. (2014) used a setting of and . In comparison to these reported simulation studies, our setting of ,

appears to be more challenging, in which larger variances in the generative model add additional difficulty in the analysis. Therefore MCP/SCAD penalized estimators do not perform well in these settings considered in our simulation studies, which clearly demonstrate the advantage of our proposed method.

### 4.2 Gaussian Process Regression

We now report simulation results to mimic the scalar-on-image regression. In the simulation, we consider a two-dimensional image structure and the whole region is generated by Gaussian processes. In these models, let and , and consider the following covariance structures:

 Model 7 (GP1(10)):σij=exp(−∥si∥2−∥sj∥2−10∥si−sj∥2), Model 8 (GP1(5)):σij=exp(−∥si∥2−∥sj∥2−5∥si−sj∥2),

where and are two points in the rectangle

. And the coefficients with nonzero effects are in a circle of the graph, with radius 0.1 for the true parameters. The values of the nonzero effects are generated by a uniform distribution on

, similar to the setting given in He et al. (2018). The errors follow the same mixture model, . For Models 7–8, set , and refer to case (a), (b) and (c) with and , respectively. A typical realization of for Model 7 is illustrated in Figure 3(a).

In this section, we also evaluate the soft-threholded Gaussian Process(STGP) estimator(Kang et al., 2018), which is a state-of-art scalar-on-image regression method. STGP assumes a spatial smootheness condition, which is satisfied by Model 7-10. The original STGP estimator requires knowledge of whether any two predictors are adjacent. Such information may be unavailable or cannot be implemented in many cases. For example, when the number of voxels are too large to be fully processed, we may have to preselect a subset of voxels or use certain proper dimension reduction techniques to overcome related computational difficulties. Given that all other methods do not need such adjacency information, to fairly compare STGP with other methods, we consider two STGP estimators in our comparison. That is, STGP represents the original STGP estimator; STGP(no-info) represents STGP estimator without using adjacency information.

As shown in Table 3, Lasso, Adaptive Lasso, SCAD and MCP fail to identify most of the true predictors and have a very high FNR, while our proposed estimator is consistently the best among all these models. It also indicates that the thresholding function helps us deal with these very complicated covariance structures of predictors. RCT also outperforms STGP(no-info), and has compatible performance with original STGP estimator, despite our estimator are not restricted with adjacency information.

We consider two more cases when the group lasso penalty is necessary to be applied. In Models 9–10, we partition the whole image space into 25 sub-regions with equal numbers of predictors in each region. The covariance structure inside each region is the same as Models 7–8. Correlations between different regions are 0.9. Then we randomly assign two regions with each non-zero regions of radius 0.13, which makes around 1/3 of the points within the selected region has non-zero effects. We assign all non-zero effects as 2. The covariance structures can be summarized as:

 Model 9 (GP25(10)): x∼N(μb,Σ), σij=exp(−∥si∥2−∥sj∥2−10∥si−sj∥2), Model 10 (GP25(5)):x∼N(μb,Σ), σij=exp(−∥si∥2−∥sj∥2−5∥si−sj∥2),

where is the mean for region , and , . For the noise term, we still set and as case (a),(b) and (c), respectively, and . For these two models, we compare performances of the Lasso, the group Lasso (GLasso), the sparse Group Lasso (SGL), STGP and the proposed estimator (RCT) in Table 4. As a key difference from Models and , Models and impose the group structure on both and . Figure 3(b) shows an illustration of the simulated imaging predictor in Models 7 and 9.

In Table 4, we include the region level false positive rate (Region FPR) and region level false negative rate (region FNR) to measure the variable selection accuracy within each region. They are computed based on whether there is at least one variable in the region is selected. Compared with the group penalty, the proposed method identified almost all the correct groups with zero false negatives and achieved a more satisfying FPR and the loss. The proposed method is less likely to over-select either predictors or regions than the sparse group Lasso. It implies that the thresholding function effectively achieved a lower FPR. Again, RCT has compatible performance with original STGP estimator, despite our estimator are not restricted with adjacency information.

## 5 Application to Scalar-on-Image Analysis

This section applies the proposed method to analyze the 2-back versus 0-back contrast maps derived from the -back task fMRI imaging data in the Adolescent Brain Cognitive Development (ABCD) study (Casey et al., 2018). Our goal is to identify the important imaging features from the contrast maps that are strongly associated with the risk of psychairtric disorder, measured by the general factor of psychopathology (GFP) or “p-factor”. After the standard fMRI prepossessing steps, all the images are registered into the 2 mm standard Montreal Neurological Institute (MNI) space consisting of 160,990 voxels in the 90 Automated Anatomical Labeling (AAL) brain regions. With the missing values being removed, the data used in our analysis consists of 2,070 subjects. To reduce the dimension of the imaging data, we partition 90 AAL regions into 2,518 sub-regions with each region consisting of an average of 64 voxels. We refer to each subregion as a super-voxel. For each subject, we compute the average intensity values of the voxels within each super-voxel as its intensity. We consider those 2,518 super-voxel-wise intensity values as the potential imaging predictors.

There are several challenging issues in the scalar-on-image regression analysis of this dataset. First, the correlations between super-voxels across the 90 AAL regions can be very high and the correlation patterns are complex. In fact, there are 151,724 voxel pairs across these regions having a correlation larger than 0.8 (or less than ), and 9,038 voxel pairs with a correlation larger than 0.9 (or less than ). Figure 1 visualizes the region-wise correlation structures, where panel (a) shows the highest correlations between regions; and panel (b) counts the voxel pairs that have a correlation hig