# Multivariate functional group sparse regression: functional predictor selection

In this paper, we propose methods for functional predictor selection and the estimation of smooth functional coefficients simultaneously in a scalar-on-function regression problem under high-dimensional multivariate functional data setting. In particular, we develop two methods for functional group-sparse regression under a generic Hilbert space of infinite dimension. We show the convergence of algorithms and the consistency of the estimation and the selection (oracle property) under infinite-dimensional Hilbert spaces. Simulation studies show the effectiveness of the methods in both the selection and the estimation of functional coefficients. The applications to the functional magnetic resonance imaging (fMRI) reveal the regions of the human brain related to ADHD and IQ.

## Authors

• 1 publication
• 3 publications
10/08/2020

### Multivariate functional responses low rank regression with an application to brain imaging data

We propose a multivariate functional responses low rank regression model...
09/30/2021

### Robust High-Dimensional Regression with Coefficient Thresholding and its Application to Imaging Data Analysis

It is of importance to develop statistical techniques to analyze high-di...
03/01/2021

### Tangent functional canonical correlation analysis for densities and shapes, with applications to multimodal imaging data

It is quite common for functional data arising from imaging data to assu...
11/12/2021

### High-Dimensional Functional Mixed-effect Model for Bilevel Repeated Measurements

The bilevel functional data under consideration has two sources of repea...
05/23/2019

### Adaptive Function-on-Scalar Regression with a Smoothing Elastic Net

This paper presents a new methodology, called AFSSEN, to simultaneously ...
01/13/2021

### Concurrent Object Regression

Modern-day problems in statistics often face the challenge of exploring ...
05/17/2019

### ACE of Space: Estimating Genetic Components of High-Dimensional Imaging Data

It is of great interest to quantify the contributions of genetic variati...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In the past decades, functional data analysis (FDA) has received a great attention in which an entire function is an observation. Ramsay2005 introduced a general framework of FDA and many other researchers investigated the estimation and inference methods of functional data. See Yao2005, Yao2005a, Horvath2012, and wang2016functional. More recently, FDA has been extended to multivariate functional data that can deal with multiple functions as a single observation. See (CHIOU2016301; happ2018multivariate). However, the sparseness of functional predictors in the multivariate model has not been studied well compared to the univariate case. Hence, we aim to develop theories and algorithms for the sparse functional regression methods with functional predictor selection when we have scalar data as response values and high-dimensional multivariate functional data as predictors.

Under the multivariate setting, numerous sparse models have been studied with the introduction of -penalty. Least absolute shrinkage and selection operator (LASSO) introduces a penalty term to the least square cost function which performs both variable selection and shrinkage LASSO. The LASSO-type penalty, such as the Elastic Net elast, the smoothly clipped absolute deviation (SCAD) SCAD, their modifications (the adaptive LASSO zou2006adaptive and the adaptive Elastic Net adelas) are developed to overcome the lack of theoretical support and the practical limitations of the LASSO such as the saturation. These methods were developed to overcome the challenges and enjoy asymptotic properties when the sample size increases, such as the estimation consistency and the selection consistency, also known as the oracle property.

Recently, the sparse models have been extended to the functional data. Initially, a majority of the literature seeks the sparseness of the time domain. Examples include james2009functional and related articles for univariate functional data and blanquero2019variable for multivariate functional data. On the other hand, Robust

proposed a model considering the sparseness in the functional predictors under the multivariate functional data setting. In particular, they introduced a model based on the least absolute deviation (LAD) and the group LASSO in the presence of outliers in functional predictors and responses. Its numerical examples and data application show the effectiveness in practice, but theoretical properties and detailed algorithm have not been explored. To this end, we develop methods for scalar-on-function regression model which allows sparseness of the functional predictors and the simultaneous estimation of the smooth functional coefficients. To implement it with the actual data, we derive two algorithms for each of the optimization problems. Finally, we show both the functional predictor selection consistency and the estimation consistency.

One motivating example for our methods is the application to the functional magnetic resonance imaging (fMRI). The dataset consists of the functional signals of the brain activities measured by blood-oxygen-level-dependent (BOLD), which detects hemodynamic changes based on the metabolic demands followed by neural activities. There are pre-specified regions of the brain, and the BOLD signals associated with multiple voxels in each region are integrated into one signal for that region. Thus, the fMRI data are considered to be multivariate functional data in which each functional predictor represents the signals from a region of the brain. In Section 8, we regress the ADHD index to the regional BOLD activities of the fMRI of the human subjects. There are regions of the brain in the data, and our methods reduce the regions to 41 regions with significantly lower errors than the linear functional regression. Figure 1 displays the regions of the brain’s atlas that are identified by our method. It shows that the methods simplify the data analysis and provide clear representation while keeping the crucial information. The analysis shows that there is an urgent need for new methods in the fields of medical and life sciences as well as other related areas. The following quote from bandettini2020fmri further motivates to study the applications of the sparse multivariate functional regression in the field of fMRI.

Think of the challenge of the fMRI with the analogous situation one would have if, when flying over a city at night, an attempt is made to determine the city activities in detail by simply observing where the lights are on. The information is extremely sparse, but with time, specific inferences can be drawn.

— Peter A. Bandettini , fMRI, 2020

The rest of the paper is organized as follows. In Section 2, we illustrate the general framework of our methods along with the notations used in this paper. In Section 3, we describe the model and the optimization problem that we consider. Then, we develop an explicit solution to the optimization problem, and illustrate a detailed procedure using alternating direction method of multipliers (ADMM) in Section 4. We also derive another algorithm, called groupwise-majorization-descent (GMD), along with the strong rule for faster computation in Section 5. In Section 6, we develop asymptotic results, including the consistency of our methods and the oracle property. In Section 7, we show the effectiveness of the methods by conducting simulation studies. In Section 8, we apply the methods to a resting state fMRI dataset. Concluding discussions are made in section 9. Finally, the appendix includes all of the proofs and the list of the regions of the brain associated with the ADHD and the IQ scores. We created an R package MFSGrp for the computation, and it is available at https://github.com/Ali-Mahzarnia/MFSGrp.

## 2 Preliminary and notation

Let

be a probability space. Let

be a compact set in for . Let be separable Hilbert spaces of functions from to with an inner product . Let be endowed with the inner product

 ⟨f,g⟩H=⟨f1,g1⟩H1+…+⟨fp,gp⟩Hp,

for any , and . Then, is also a separable Hilbert space. Let be a measurable function with respect to where is the Borel -field generated by open sets in .

Let be a random element in . If ; then, the linear functional is bounded. By the Riesz’s representation theorem, there is a unique element in , say , such that for any . Conway1990. We call the mean element of or expectation of . If we can further assume , the operator ,

 ΓXX=E[{X−E(X)}⊗{X−E(X)}], (1)

exists and is a Hilbert-Schmidt operator, where

indicates a tensor product computed in a way that for

, . hsing2015theoretical.

Let be a random element in . Subsequently, we can define the covariance operator between and by

 ΓYX=E[{Y−E(Y)}⊗{X−E(X)}],

which maps from to . can be similarly defined. For convenience, throughout this paper, we assume that and without loss of generality. Hence, the regression model is

 Y=⟨X,β⟩H+ϵ,

where is the unknown coefficient function, and

is an error term which is a mean zero random variable and independent of

. Consider as a scalar random variable. We can rewrite and

 ⟨X,β⟩H=∑pj=1⟨Xj,βj⟩Hi.

## 3 Model description

We are interested in the situations where the predictors are multivariate functions but only a few functional predictors affecting the response. i.e., a random variable and random functions have the following relation,

 Y=∑j∈\rsfstena A⟨Xj,βj⟩Hj+ϵ, (2)

where is an unknown active set of indices involved in this regression model, and is a mean zero error term that is independent of .

Assume that we have a random sample of size from the model (2). To estimate and the active set , we propose the following objective function.

 L(β;λ1n)=12En(Y−⟨X,β⟩H)2+λ1n∑pj=1∥βj∥Hj,β∈H, (3)

where is the expectation with the empirical distribution. We added the group-lasso type penalty so that each group includes one functional component in the infinite dimensional Hilbert space, , . Note that the norm in the penalty term is -norm which makes the objective function convex. In addition, we propose an alternative objective function to gain a more stable solution path.

 L(β;λ1n,λ2n)=12En(Y−⟨X,β⟩H)2+λ1n∑pj=1∥βj∥Hj+λ2n∑pj=1∥βj∥2Hj,β∈H, (4)

The quadratic term allows us to have a stable solution path and encourages further grouping effects. It is similar to the Elastic Net proposed by elast, but it is different in that the norm in the first penalty term uses -norm, and both the two penalties are applied group-wisely. The group-wise second penalty also gives us a huge computational advantage.

Furthermore, we also consider the smoothing penalty of the functional coefficients by adding the term, to the objective functions, (3) and (4). It allows us to estimate smooth functional coefficients and to select the functional predictors simultaneously. In addition, it provides a better interpretation of the functional coefficients in this linear functional regression model.

In this section, we develop the algorithm for solving the optimization problems introduced in Section 3 via the alternating direction method of multipliers (ADMM), one that is popularly used in a general convex optimization problem. See ADMM. Consider the following optimization problem.

 argminβ,γ f(β)+g(γ) (5) s.t. β−γ=0,

where is duplicate variable in , , and . Blocks are associated with their counterparts’ blocks . The augmented Lagrangian with its parameter is

 Lρ(β,γ,η)=f(β)+g(γ)+⟨η,β−γ⟩H+ρ2∥β−γ∥2H, (6)

where the Lagrangian multiplier is . The ADMM update rules are

 βnew:=argminβLρ(β,γ,η)γnew:=argminγLρ(βnew,γ,η)ηnew:=η+ρ(βnew−γnew). (7)

For computational convenience, it is a usual practice to consider the scaled dual parameter of the ADMM. Let . It is straightforward to verify that the update rules (7) with scaled dual parameter are equivalent to

 βnew:=argminβ(f(β)+ρ2∥β−γ+U∥2H)γnew:=argminγ(g(γ)+ρ2∥βnew−γ+U∥2H)Unew:=U+βnew−γnew. (8)

### 4.1 Coordinate representation of functional data

Our method is based on the basis-expansion approach to the functional data. Suppose that we have random copies from the model (2) denoted by and we observe on for each and .

At the sample level, we assume that is spanned by a given set of basis functions, . Thus, for any

, there exist a unique vector

such that . We call the vector , the coordinate of and denote it . We also assume that is constructed with the -inner product with respect to the Lebesgue measure,

 ⟨f,g⟩Hj=∫Tjf(t)g(t)dt,%foranyf,g∈Hj.

Let be matrix whose -th entry is , and let be block-diagonal matrix whose -th block is where . Consequently, for any ,

 ⟨f,g⟩H=∑pj=1∑mji=1∑mjk=1([fj]Bj)i([gj]Bj)k⟨bji,bjk⟩Hj=∑pj=1[fj]TBjGj[gj]Bj=[f]TBG[g]B,

where are the -dimensional vectors obtained by stacking and respectively. We use the basis-expansion approach for each functional covariate for and , which is also used in song2021; li-song-2018. Without loss of generality, we assume and .

Suppose that is a linear operator from to in which the basis for is and the basis for is . Then, we define the coordinate representation of the operator to be matrix, say , whose -th entry is . It can be easily shown that for any . For notational convenience, if the basis system is obvious in the context, we remove the subscripts of the coordinate representation throughout this paper. The following lemma provides a further simplification for easy computations.

###### Lemma 1.

Let . Let be the matrix, the -th column of which is . Then

 B[^ΓXX]B=n\tiny−1[X1:n]BQ[X1:n]TBG=n\tiny−1[~X1:n]B[~X1:n]TBG,

where . In addition, let be the -dimensional vector, the elements of which are the observations . Then

 [^ΓYX]=n\tiny−1YT[~X1:n]TBG.

### 4.2 Orthogonalization

To achieve computational efficiency, we orthonormalize the basis system via Karhunen-Loève expansion of the covariance operator of each of the functional predictors. For each , define to be the covariance operator of . Consequently, we have the following lemma.

###### Lemma 2.

Let

be the pairs of eigenvalues and vectors of

with , and let for . Then, the Karhunen-Loève expansion of is

 ^Γjj=m∑k=1λjkϕjk⊗ϕjk.

Define a matrix

 Φj=([ϕj1]Bj⋯[ϕjm]Bj).

Since

’s are the eigenfunctions of a self-adjoint operator, they are orthonormal. Thus, for any

,

 x(⋅) =m∑k=1⟨x,ϕjk⟩Hjϕjk(⋅) =m∑k=1[x]TBjGj[ϕjk]Bjϕjk(⋅).

Define to be the new basis system for . Then, we have

 [Xji]Cj=(Φj)TGj[Xji]Bj,i=1,…,n,j=1,…,p.

We assume that the coordinate of is based on the orthonormal basis system throughout this section. Thus,

 [^ΓXX]=diag(λ11,…,λ1m,…,λp1,…,λpm),[X1:n]=diag(ΦT1,…,ΦTp)G[X1:n]B,

and for any .

### 4.3 Estimation

Using the representation, we can express the optimization (8) as follows.

 [βnew]:=argminβ∈H(f(β)+ρ2([β]−[γ]+[U])T([β]−[γ]+[U]))[γnew]:=argminγ∈H(g(γ)+ρ2([βnew]−[γ]+[U])T([βnew]−[γ]+[U]))[Unew]:=[U]+[βnew]−[γnew], (9)

where

 f(β) =12En(Y−⟨X,β⟩H)2=(2n)\tiny−1∑ni=1{Y2i−2(Yi⊗Xi)β+⟨β,(Xi⊗Xi)β⟩H} =12^σYY−^ΓYXβ+12⟨β,^ΓXXβ⟩H=12^σYY−[^ΓYX][β]+12[β]T[^ΓXX][β],

and .

Under the finite-dimensional representation of functional element in , one can see that the optimization in (9) is a convex optimization problem.

###### Theorem 1.

The solution to the optimization problem (3) can be achieved by iterating over the following update rules.

 [βnew]=([~X1:n][~X1:n]T+nρIM)−1([~X1:n]Y+nρ([γ]−[U])) [(γj)new]=SHjλρ([(βj)new]+[Uj]) j=1…p (10) [Unew]=[U]+[βnew]−[γnew],

where , are corresponding blocks to , and for .

If we do not consider orthogonalization, Theorem 1 would contain element in the updates. In this case, the proof of numerical convergence of the update rules is slightly different from that of ADMM. However, due to the orthogonalization, the proof of the numerical convergence of the updates in the Theorem 1 to the solution of the optimization problem (3) is identical to that of the ADMM in ADMM. Hence, it is omitted.

### 4.4 Different penalty terms

In this section, we investigate the different penalty terms in two directions: one for the functional predictor selection, and the other one for the smooth coefficient functions .

#### 4.4.1 Multivariate Functional Group Elastic Net

LASSO does not provide a unique solution. In order to achieve uniqueness and overcome the saturation property, Elastic Net penalty has been introduced by combining the -norm and -norm by elast for the multivariate data. Functional data are intrinsically an infinite-dimensional objects. Thus, we propose a multivariate functional-version optimization problem for the Elastic net penalty by grouping each functional predictor as follows.

 12En(Y−⟨X,β⟩H)2+λ(1−α)∑pj=1∥βj∥Hj+αλ∑pj=1∥βj∥2Hj, (11)

where and are the tuning parameters.

This optimization problem still follows the structure of the ADMM algorithm in (5) with . It can be easily shown that the only difference from the original version is the -update in Theorem 1. Hence, we have the following update rules.

###### Theorem 2.

The solution to the optimization problem (11) can be achieved by iterating over the following update rules.

 [βnew]=([~X1:n][~X1:n]T+nρIM)−1([~X1:n]Y+nρ([γ]−[U])) [(γj)new]=ρρ+2αλSHjλ(1−α)ρ([(βj)new]+[Uj]) j=1…p (12) [Unew]=[U]+[βnew]−[γnew].

Regularization parameters can be adjusted through a net search cross validation.

#### 4.4.2 Smoothness of functional coefficients β

According to the simulation, we found that the previous algorithm provides wiggly estimation of functional coefficients most of the time. It might be fine if we are only interested in the prediction; however, it is not the case, because we consider the linear functional regression.We propose an algorithm which controls the roughness of simultaneously to avoid the over-fitting problems and to obtain smooth functional coefficients. In particular, we impose the penalty on the curvature of the coefficients by adding to the objective function (8). We include this term in function in the ADMM structure. Finally, the first update rule (1) in Theorem 1 becomes

 [βnew]:=([~X1:n][~X1:n]T+nρIM+λderG′′)−1([~X1:n]Y+nρ([γ]−[U])), (13)

where is a block-diagonal matrix whose -th block matrix is for , .

For each , can be derived from the second derivative Gram matrix for the original basis, say , where . Note that

 [ϕji]Bj=(Gj)\tiny−1((Φj)\tiny−1)T[ϕji]Cj=(Gj)% \tiny−1((Φj)\tiny−1)Tei,

where is -th standard basis in . Then,

 ⟨(ϕji)′′,(ϕjk)′′⟩Hj =⟨∑mℓ=1([ϕji]Bj)ℓ(bjℓ)′′,∑mℓ=1([ϕjk]Bj)ℓ(bjℓ)′′⟩Hj =eTi(Φj)\tiny−1(Gj)\tiny−1(Bj)′′(Gj)\tiny−1(Φj)\tiny−1)Tej.

where .

#### 4.4.3 Tuning

The initial values for and are zero, and the initial

is the ridge regression estimation in the first update rule (

1). We set the augmented parameter, or the step size, to be and stay the same through the algorithm. The different values of only change the values of the optimal on the grid or optimal on the net. The larger the , the smaller the optimized regularization parameter of the soft threshold operator. In some practices of augmented Lagrangian, it is possible to choose a small step size and increase it to gradually in each iteration. It is also stated in ADMM why is a suitable choice in the ADMM algorithm.

We use the k-fold cross validation for choosing the mixing parameter , regularization parameter of the second derivative penalty , and the main regularization parameter . In particular, for each and each on the net, we search for the optimal . In order to pick the initial , we first find the ridge estimation with parameter . We then compute the norm of each of the groups of functional coefficients, . Note that in the second update of Theorem 1, the soft threshold operator would eliminate all blocks if is slightly higher than the maximum of these norms. On the other hand, this update would keep all the coefficients if is slightly lower than the smallest norm. Therefore, a reasonable procedure is to design a grid of ’s between a number slightly lower than the minimum norm of the blocks and a number slightly higher than the maximum norm of these block coefficients.

## 5 Estimation: GMD

In this section, we derive the groupwise-majorization-descent (GMD) algorithm for solving the objective functions in Section 3. Unlike the ADMM, this algorithm is geared toward the objective function with group-wise penalty terms. Motivated by yang2015fastgglasso, we derive the GMD algorithm under our setting. In addition, we do not force the basis functions to be orthogonal, which allows us to have more flexibility. Thus, throughout this section, we use the coordinate system based on the original basis without orthogonalization.

### 5.1 Algorithm

The MFG-Elastic Net problem without the orthogonalization is

 argminβ12∥Y−[~X1:n]TG[β]∥22+λder2[β′′]TG[β′′]+λ(1−α)∑pj=1∥βj∥Hj+αλ∑pj=1∥βj∥2Hj, (14)

where the coordinates are associated with the original basis . This optimization problem and the following derived algorithm include the steps that also solve for the MFG-Lasso and the ridge regression . In the equation (14), we remove for computational convenience. It will be adjusted when we seek the and

in the grid construction. We define the loss function as follows.

 L([β])=12∥Y−[~X1:n]TG[β]∥22+λder2[β′′]TG[β′′]. (15)

Consequently, the objective function (14) is where .

###### Lemma 3.

The loss function (15) satisfies the quadratic majorization (QM) condition with . In other words, for any ,

 L([β])≤L([β∗])+([β]−[β∗])∇L([β∗])+12([β]−[β∗])TH([β]−[β∗]), (16)

where,

 ∇L(β∗|D)=G[~X1:n]([~X1:n]TG[β]−Y)+λderB′′[β∗]. (17)

In addition to Lemma 3, it is straightforward to see that if , we have the strict inequality,

 L(β|D)

Thus, it leads to the strict descent property of the updating algorithm. Let be the current solution to the optimization problem and be the next update. Assume that we update the for each functional predictor . In other words, has a form of , which leads to simplification of the objective function in the new optimization problem. Let and be the sub-vector of with the indices . Let be the -th block diagonal matrix of . Then, (16) is

 L([β]) ≤L([β∗])−([βj]−[(β∗)j])Uj+12([βj]−[(β∗)j])THj([βj]−[(β∗)j]) ≤L([β∗])−([βj]−[(β∗)j])Uj+12γj([βj]−[(β∗)j])T([βj]−[(β∗)j]),

where is a value slightly larger than the largest eigenvalue of , which further relaxes the upper bound. In practice, we take with where is the largest eigenvalue of . Finally, the update rule for is the solution to the following optimization problem.

 argminβj∈Hj−([βj]−[(β∗)j])Uj+12γj([βj]−[(β∗)j])T([βj]−[(β∗)j])+gj(β), (19)

where is the -th term of . We have a closed-form solution to this problem using a similar trick of Lemma 6.

 [βj]