Nonparametric Group Variable Selectionwith Multivariate Response forConnectome-Based Prediction of Cognitive Scores

In this article, we study possible relations between the structural connectome and cognitiveprofiles using a multi-response nonparametric regression model under group sparsity. The aim is to identify the brain regions having significant effect on cognitive functioning. In this article, we consider nine different attributesfor each brain region as our predictors. Here, each node thus correspond to a collectionof nodal attributes. Hence, these nodal graph metrics may naturally be grouped togetherfor each node, motivating us to introduce group sparsity for feature selection. We proposeGaussian RBF-nets with a novel group sparsity inducing prior to model the unknown meanfunctions. The covariance structure of the multivariate response is characterized in termsof a linear factor modeling framework. For posterior computation, we develop an efficientMarkov chain Monte Carlo sampling algorithm. We show that the proposed method per-forms overwhelmingly better than all its competitors. Applying our proposed method to aHuman Connectome Project (HCP) dataset, we identify the important brain regions andnodal attributes for cognitive functioning, as well as identify interesting low-dimensionaldependency structures among the cognition related test scores.

Authors

• 11 publications
• Bayesian Regression with Undirected Network Predictors with an Application to Brain Connectome Data

03/28/2018 ∙ by Sharmistha Guha, et al. ∙ 0

• Predicting Phenotypes from Brain Connection Structure

This article focuses on the problem of predicting a response variable ba...
10/06/2019 ∙ by Subharup Guha, et al. ∙ 0

• Capturing Between-Tasks Covariance and Similarities Using Multivariate Linear Mixed Models

We consider the problem of predicting several response variables using t...
12/10/2018 ∙ by Aviv Navon, et al. ∙ 0

• Unsupervised learning with GLRM feature selection reveals novel traumatic brain injury phenotypes

Baseline injury categorization is important to traumatic brain injury (T...
11/30/2018 ∙ by Aaron J. Masino, et al. ∙ 0

• Bayesian Nonparametric Covariance Regression

Although there is a rich literature on methods for allowing the variance...
01/11/2011 ∙ by Emily Fox, et al. ∙ 0

• Nonparametric Matrix Response Regression with Application to Brain Imaging Data Analysis

With the rapid growth of neuroimaging technologies, a great effort has b...
03/31/2019 ∙ by Dehan Kong, et al. ∙ 0

Recent technological advancements have enabled detailed investigation of...
04/01/2021 ∙ by Shariq Mohammed, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Structural connectome of the human brain is a network of white matter fibers connecting grey matter regions of the brain (Park and Friston, 2013). Our everyday activities have been found to be influenced by these brain connectomes (Bjork and Gilman, 2014; Smith et al., 2015; Toschi et al., 2018; Zhang et al., 2019; Guha and Rodriguez, 2021). These studies have found variations in human connectomes as there are changes in different personal traits. Functional and structural connectomes are routinely analyzed to find their associations with the human traits. Leveraging on recent technological developments in non-invasive brain-imaging, several data repositories such as the Human Connectome Project (HCP) (WU-Minn, 2017), the UK Biobank (Miller et al., 2016) are created to store huge collection of brain data.

Xu et al. (2020) recently applied group variable selection method to identify important brain regions and nodal metrics for predicting disease state among healthy and mildly cognitive impaired subjects using functional connectivity networks. In this article, we study the association between structural connectomes and seven cognition related test scores using nodal attributes as predictors. Most existing works consider the edge-wise data of connectivity matrices as predictors while studying association between human connectomes and behavioral attributes. However, if the inferential problem is to identify important brain regions associated with that attribute, it is difficult to infer such results from edge-wise data. Feature selection from nodal attributes can help in such inferential problems. Our specific choices of nodal attributes are nodal degree, local efficiency, betweenness centrality, within-community centrality, node impact, the shortest path length, eccentricity (which is the maximal shortest path length between a node and other nodes), leverage centrality, and participant coefficient, which are described in detail in the next section.

In this study, we consider the seven cognition related test scores as response from NIH Toolbox Cognition Battery which are described at https://www.healthmeasures.net/explore-measurement-systems/nih-toolbox/intro-to-nih-toolbox/cognition. Furthermore, we only consider age adjusted scores as they are comparable across the population. Further details on the dataset are in Section 2.2 and Section 5. Our primary inferential goals include 1) identifying the low-dimensional dependency structures among the cognition related test scores after adjusting for the connectomic structures, and 2) identifying the brain regions, and nodal features of the structural connectome, having important effect on brain cognition. We develop nonparametric statistical model focusing on these two inferential goals.

The nodal attributes can naturally be clustered into groups according to the corresponding nodes. In cancer genomics, the adjacent DNA mutations on the chromosome are grouped together as they have a similar genetic effect (Li and Zhang, 2010). Based on chromosomes, SNPs are also grouped together (Kim and Xing, 2012; Liquet et al., 2017)

. Other applications may include multi-level categorical predictors which are commonly characterized by a group of dummy variables. Thus, variable selection methods under group sparsity are proposed in various modeling frameworks.

Yuan and Lin (2006) proposed the group lasso method with an penalty on the predictors within each group and an penalty across the groups. Subsequently, several extensions are proposed for different modeling frameworks (Ma et al., 2007; Meier et al., 2008; Zhou et al., 2010; Li et al., 2015). In addition, one may refer to Huang et al. (2012) for a complete review on this topic. In Bayesian framework, several approaches are proposed for group variable selection (Li and Zhang, 2010; Rockova and Lesaffre, 2014; Curtis et al., 2014; Xu and Ghosh, 2015; Chen et al., 2016; Liquet et al., 2017; Greenlaw et al., 2017; Ning et al., 2020).

Most existing methods imposing group sparsity consider a linear relationship between the predictors and the response. In many occasions, this might be too restrictive. We thus model the mean of cognitive scores as nonparametric functions of the predictors, the nodal attributes. In this paper, we consider radial basis networks using multivariate Gaussian Radial Basis Function (RBF) to model the nonparametric mean functions. RBF net is an artificial neural networks (ANN) that uses radial basis functions as activation functions. The architecture of the proposed RBF-net also allows imposing group sparsity priors. One difficulty in using RBF-net or ANN in general, is the selection step of the number of hidden units,

. Mostly, the selection relies on computationally expensive cross-validation based methods. We, however, propose a computationally efficient approach in selecting , which performed remarkably well both in simulations and real-data applications.

Studies suggest that any cognitive task requires collective resources from different parts of the brain (Haxby et al., 2001; Poldrack et al., 2009; Zhang et al., 2013; Muhle-Karbe et al., 2017; Ito et al., 2017), motivating us to assume the performances in above cognitive tasks to be correlated. To characterize the joint covariance, we rely on a latent factor modeling framework. Instead of assuming an unstructured covariance, latent factor modeling helps to infer about low-dimensional dependencies of the data. The latent factor model can naturally relate the measured multivariate cognitive scores to lower-dimensional characteristics, while reducing the number of parameters needed to represent the covariance. Following Roy et al. (2020)

, we consider the heteroscedastic latent factor model, which is shown to outperform traditional latent factor model on recovering true loading structure. For a mean-zero

-variate data , the heteroscedastic latent factor model is given by where are latent factors, is a factor loading matrix, the diagonal covariance of latent factors , and

, a diagonal matrix of residual variances. Under this model, marginalizing out the latent factors

induces the covariance . In terms of both prediction and variable selection, our the proposed method performs overwhelmingly better than all the competing methods. Our proposed method is also amenable to gradient-based Langevin Metropolis Carlo sampling (Girolami and Calderhead, 2011) that ensures excellent mixing of the Markov Chain Monte Carlo (MCMC) chain. Prior to applying our proposed method on HCP data, we implement a screening method. Since, the predictors, in our application, can easily be clustered into groups, we propose a novel screening technique for predictors, divided into groups. The prediction performance of our proposed method is again shown to be overwhelmingly better in HCP data. Furthermore, we obtain interesting results in dependency patterns among the cognitive scores, and in effects of nodal attributes.

The rest of the article is organized as follows. In the next section, we describe the dataset and introduce of the graph theoretic attributes to be considered in our analysis and show some exploratory results on the HCP dataset. Section 3 presents our proposed multi-response nonparametric group variable selection method with complete prior specifications, and details on posterior computation. In Section 4, we compare our proposed method with other competing methods. We illustrate our HCP data application in Section 5. In Section 6, we discuss possible extensions, and other future directions.

2 Data description

2.1 Preliminaries on graph theoretic attributes

Graph theoretic attributes are routinely analyzed while studying connectivity structures of human brain (Tian et al., 2011; Wang et al., 2011; Medaglia, 2017; Masuda et al., 2018; Farahani et al., 2019). We consider nine local graph theoretic attributes based on above-mentioned studies in our analysis. Local attributes are node-specific attributes. The selected attributes are commonly studied for brain network analysis. In this paper, the terms network and graph are used interchangeably. We rely on two R packages to compute the nodal attributes in our analysis, namely NetworkToolbox (Christensen, 2018), and brainGraph (Watson, 2020).For clarity, we describe the definitions and characteristics of the nodal attributes below.

For a graph with nodes, let us denote the adjacency matrix by . Then, the degree of node is .The degree refers to the strength of each node in a graph. To define local efficiency, we first need to define global efficiency. Global efficiency of the graph is measured as , where the shortest path length between the nodes and . The nodal efficiency is defined by , where is the subgraph of neighbors of and stands for the number of nodes in that subgraph. The corresponding betweenness centrality is , where stands for the total number of the shortest paths from node to node and denotes the number of those paths that pass through . Nodal impact quantifies the impact of each node by measure the change in average distances in the network upon removal of that node (Kenett et al., 2011). Local shortest path length of each node is defined by the average shortest path length from that node to other nodes in the network. It is used to quantify mean separation of a node in a network. Eccentricity is the maximal shortest path length between a node and any other node. Thus, it captures how far a node is from its most distant node in the network. Participant coefficient requires detecting communities first within the network. While using the R function participation of R package NetworkToolbox, we use the walktrap algorithm to detect communities (Pons and Latapy, 2005). Participation coefficient quantifies distribution of a node’s edges among different communities in the graph. It is zero if all the edges of a node are entirely restricted to its own community, and it reaches to its maximum value of 1 when the node’s edges are evenly distributed among all the communities. Within-community centrality is used to describe how central a node’s community is within the whole network. We again use the walktrap algorithm to detect the communities. The final centrality measure that we include in our study is leverage centrality, introduced by Joyce et al. (2010). Leverage centrality is the ratio of the degree of a node to its neighbors, and it thus measures the reliance of its immediate neighbors on that node for information. In Tables 1 to 9 of the supplementary materials, we show exploratory analyses on predictive performances of the above nodal attributes. We divide the data 70/30 into training and testing for this analysis. Each table corresponds to a prediction model taking one type of nodal attribute for all the brain regions as predictors. There is no single nodal attribute which shows uniformly better predictive performances. In general, all the methods perform slightly better while predicting the cognitive scores related to processing speed and flanker task. In this work, we do not conduct any sensitivity analysis for the nodal metric computation algorithms.

2.2 Cognitive scores

We collect behavioral data of each participant on cognitive test scores from HCP. These scores are on oral reading recognition, picture vocabulary, processing speed (pattern completion processing speed), working memory (list sorting), episodic memory (picture sequence memory), executive function (dimensional change card sort), and flanker task. Due to the space constraint, detail descriptions of these tasks are provided in the supplementary materials. These tasks test different aspects of cognitive processing.

As a preliminary analysis, we explore the dependency pattern among the cognitive scores using the linear latent factor model of Roy et al. (2020)

. We fit the model on mean-centered and normalized data on the cognitive scores. Applying this model, we obtain the estimated loading matrix, illustrated in Figure

3 Modeling

To model nonparametric functions, RBF networks are extensively popular, first proposed in Broomhead and Lowe (1988). In terms of a radial basis , the basic formulation of the function with

many hidden neurons is

(Lee et al., 1999), , where , the radial basis maps to , ’s are weights, ’s are the centers in for the basis along with the band-width parameter

which may be a scalar or a vector depending on the formulation. The notation

stands for the distance metric, whose common choice is the Mahalanobis distance. Common choices for the function are linear, cubic, thin plate spline, multiquadric, inverse multiquadric, Gaussian, etc. More applications of the Bayesian RBF net can be found in Barber and Schottky (1998); Holmes and Mallick (1998); Ghosh et al. (2004); Ryu et al. (2013) using different radial basis kernels. In this paper, we primarily focus on ’s to be Gaussian, which is the most popular choice in RBF-net-based modeling. Besides, the architecture of Gaussian RBF-net shares commonalities with the extremely popular nonparametric density estimation method using location-scale mixture normals.

We model the distance metric of the above formulation as . Here is a diagonal matrix of dimension and ’s are correlation type matrices of the same dimension. The diagonal entries of represent ‘importance’ of different variables in . Note that to ensure positive-definiteness of , we do not require any restriction on . Hence, the entries in are kept unrestricted. We can show that becomes a constant function of if and only if . We thus impose the group sparsity prior on for our group variable selection.

To model the joint covariance of the multivariate response, we adopt the factor modeling framework. We however consider the heteroscedastic latent factor model from Roy et al. (2020) which is shown to be more efficient than the traditional latent factor model in capturing the underlying true loading structure. Under the heteroscedastic setting, the prior variance for is replaced by a diagonal heteroscedastic covariance matrix. Let us assume we have many groups denoted by . Here stands for the cardinality or the number of elements in the set. The complete set of predictors for - individual is Our proposed hierarchical model for a -dimensional response as a function of a -dimensional predictor is

 yi= f(xi)+\boldmathΛ% \boldmathηi+ϵi, (1) f(xi)= {f1(xi),…,fv(xi)}, (2) fℓ(xi)= (3) \boldmathηi∼ Normal(0,\boldmathΣ1),% \boldmathϵi∼Normal(0,\boldmathΣ2), (4)

where and the space stands for the space of correlation matrices. The two covariance matrices and are diagonal. For simplicity, we denote and where and . The latent factors, ’s are -dimensional. Thus, the loading matrix is dimensional. The covariance matrices and are and dimensional, respectively. Here is the complete set of group-specific parameters where with , the cardinality of - group. Note that . A schematic representation of proposed RBF-net is in Figure 2. In above model, .

We have the universal approximation theorem for RBF-net (Park and Sandberg, 1991) that ensures approximation of any truth using the above with arbitrarily small approximation error for large enough . Under some assumptions on the smoothness of , we can compute the upper bound to the approximation error for a given (Maiorov, 2003; Hangelbroek and Ron, 2010; DeVore and Ron, 2010; Hamm and Ledford, 2019). The matrix in (3) is non-negative definite for any and , the space of correlation matrices. Thus is an well-defined distance between and .

The diagonal matrix is shared across all the functions. It is easy to see that if in (3), then the function becomes a constant function of for all . Thus, the vector controls which set of predictors to be included in the model. We refer to this vector as the ‘importance coefficient’. We can prove following result for

Lemma 1.

The partial derivative for all if and only if , where assuming for at least one .

The proof follows directly from the partial derivatives and is in the supplementary materials. The Lemma suggests that the importance coefficient controls the inclusion or exclusion of a predictor from the model. Thus, in order to do group variable selection, we put our group sparsity prior on which is discussed in the next subsection. Our proposed model hence selects the features that are important for all the components in .

Our model for the correlation-type matrix is based on the spherical coordinate representation of Cholesky factorizations of s (Zhang et al., 2011; Sarkar et al., 2020). In this representation, the correlation matrix is written as . Then - row of is , for . The rest of entries are zero. Thus is a lower triangular matrix. The angles are supported in for and ’s are supported in .

As we complete our description of the proposed Gaussian RBF-net based regression model, the next subsection will introduce our group sparsity inducing prior. The rest of this section deals with prior specification for rest of the parameters, and their posterior computation strategies.

3.1 Group sparsity prior

There are several choices for group sparsity inducing priors in the literature. In our proposed architecture, we find that exact sparsity is required to achieve efficient estimation accuracy. We thus first reparametrize , where is the binary indicators denoting inclusion of - group and the additional indicator parameter controls inclusion of - predictor from - group. Predictors from the get a chance to be included in the model if . However, the final selection of the individual predictors in will then depend on ’s. We put following priors on the parameters.

• The coefficient: We set and .

• Group selection indicator: We take , with .

• Within group selection indicator: We consider , with .

Here

may be referred as the inclusion probability for a group in the model. Likewise,

may be referred as the inclusion probability for the predictors within - group. One may set these probabilities at

which would be a non-informative prior choice as they assign equal prior probabilities to all submodels. Since, we do not have any prior information on inclusion probabilities, we assign independent non-informative uniform prior on

’s, and additionally, we set , where is the number of groups. The choice for is motivated to select important brain regions.

3.2 Prior specification for rest of the model parameters

In this section, we discuss our priors for rest of the model parameters. We put conjugate priors whenever appropriate for computational ease. Most of our priors are weakly informative. The priors of the full set of parameters are described below in detail.

1. Loading matrix: We put the sparse infinite factor model prior of Bhattacharya and Dunson (2011) which set and

2. Variance parameters: We set both

, where Ga stands for the gamma distribution for

, and .

3. RBF-net coefficients (): We put data driven prior on following Chipman et al. (2010). We set , where and with -th response for .

4. Centers (’s) of RBF-net: We set .

5. Polar angles: We consider uniform prior for the polar angles as well, , and for .

The choices of hyperparameters for the Gamma distribution are

to ensure weak informativeness. With the above prior of , we have induced prior mean of bounded between and with probability 95%. Following Roy et al. (2020), we set and .

The motivation behind considering the sparse infinite factor model prior is to automatically select the rank. The above prior tends to shrink ’s in the later columns, leading to in those columns. The corresponding factors are thus effectively discarded from the model. In practice, one can conduct a factor selection procedure by removing factors having all their loadings within a small neighborhood of zero. To obtain a few interpretable factors, we follow this approach.

3.3 Selection of K

The model in (3) uses many RBF bases to model all the nonparametric regression functions. Here is an unknown parameter which needs to be either pre-specified or estimated. One may put a discreet prior on and implement reversible jump MCMC (Green, 1995) based computational algorithm. However, computation using such prior suffers from poor mixing and slow convergence. Cross-validation based on methods are also common to optimally pre-select for such cases. Instead of implementing computationally expensive cross-validation based methods, we propose an alternative mechanism that works remarkably well in simulations and data application.

Our selection strategy is based on Gaussian RBF kernel based relevance vector machine (RVM) (Tipping, 2001) which shares some commonalities with proposed RBF-net. For each univariate response, we separately apply the rvm function of R package kernlab (Karatzoglou et al., 2004) which is extremely fast. We can then set the maximum number of relevance vectors, combining all the outputs of rvm. However, in high dimension this strategy may not be directly applicable as the number of relevance vectors grows with the data dimension . We, therefore, select a subset of important predictors using a nonparametric screening method inspired from Liang et al. (2018)

. The strategy builds on the result that the correlation coefficient of bivariate normal random variables is zero if they are independent.

Liang et al. (2018) suggested using the Henze-Zirklers’s multivariate normality test to detect dependence after applying nonparanormal transformations (Liu et al., 2009) on the data. However, the R program from the package mvnTest did not work in HCP dataset due to some numerical issues. We thus implement the following simplified strategy. We first compute the nonparanormal transformations for , where stands for the response vector with , and also evaluate the same for predictors in case of - predictor. Here stands for the normal CDF and is the CDF for . Similarly, the CDF for is denoted by . As ’s are unknown, they are required to be estimated. We then test whether Pearson correlation between and is zero or not using the R function cor.test for each pair of and and store corresponding p-values. Define if the corresponding p-value for the test “ correlation between and is 0” is less than or equal to some cutoff . Smaller p-value implies greater dependence. In this paper, we find works well both in simulations and real-data application. Thus, this is our default choice. Let us denote the screened set of predictors as . The next step is to apply rvm on the selected subset of predictors to obtain . This strategy works remarkably well both in simulation and real-data applications. Note that the above screening step is only to select . Our posterior sampling algorithms will fit the full model with all the predictors.

3.4 Posterior Computation

We now discuss our posterior sampling algorithms for the different parameters. The log-likelihood of the proposed model is given by,

 C−p∑j=1n∑i=1{yi,j−fj(xi)−∑rk=1Λj,kβi}22σ22,j−n2log(σ22,j)−p∑k=1K∑j=1λ2k,j2v21 −G∑g=1mg∑ℓ=1β2ℓ,g2s2g+∑ℓγℓlogqG+G∑ℓ=1(1−γℓ)log(1−qG)+∑ℓρℓ,glogbg +G∑ℓ=1(1−ρℓ,g)log(1−bg)+p∑j=1{(c1+n/2−1)log(σ−22,j)−c2σ−22,j} +G∑g=1{(c1+mg/2−1)log(s−2g)−c2s−2g} +r∑j=1{(c+n/2−1)log(σ−21,j)−c2σ−21,j},

where the constant

involves the parameters of the hyperpriors. The parameters

’s, ’s and ’s enjoy conjugacy, and thus they are sampled directly from their full conditionals. We implement Bayesian backfitting type MCMC algorithm for sampling ’s (Hastie and Tibshirani, 2000). To sample , we implement a gradient-based Langevin Monte Carlo (LMC) sampling algorithm. Gradient based samplers are more efficient in complex hierarchical Bayesian model. For polar angles and ’s, we consider a random walk-based MH step. We update the ’s and ’s using a Gibbs sampler as in Kuo and Mallick (1998). We start our computation setting and for all and , and keep that fixed in the initial part of the chain. Further details on parameter initialization and computational steps are postponed to the supplementary materials.

4 Simulation

In this section, we study empirical efficacy of the proposed method. The true regression functions have both linear and nonlinear parts. The true loading matrix is shown in Figure 3. It has 3 true factors. Our simulations investigate both prediction efficiency and recovery of the true loading structure. Furthermore, we also compare selection efficacy with few other non-parametric selection methods.

We generate 50 replicated datasets for each simulation scenario. For each replication, we generate 100 observations. Next, we divide those 70/30 into training and testing. We consider two scenarios, with the number of predictors being or . Based on these predictors, we form groups with members following our motivating HCP data application. Thus, when , there are groups and for , we have . Hence, the - group will include predictors with index . We estimate the model parameters using training data and calculate prediction mean squared errors (MSE) on the test set for all the methods. The prediction MSE is defined as , where stands for the test set for - response.

We compare our proposed RBF-net with SoftBART, Dirichlet Additive Regression Tree (DART), Random Forest (RF), Relevance Vector Machine (RVM), LASSO, Adaptive LASSO (ALASSO), Adaptive Elastic Net (AEN), SCAD, and Multivariate LASSO. For SoftBART and DART, we consider 3 choices, 20, 50, and 200 for the number of trees as in

Linero (2018). For all the Bayesian methods, we collect 5000 postburn samples after burning in 5000 initial samples. We use the following R packages for fitting the competing methods gglasso (Yang et al., 2020) e1071R (Meyer et al., 2019), glmnet (Friedman et al., 2010), BART(McCulloch et al., 2019), ncvreg(Breheny and Huang, 2011), gcdnet(Yang and Zou, 2017), randomForest (Liaw and Wiener, 2002), kernlab (Karatzoglou et al., 2004), and the R package for softBART is from https://github.com/theodds/SoftBART. In addition, we attempted to fit multi-response linear group variable selection method of MBSGS (Liquet et al., 2017). However, it was difficult to compute prediction MSEs with this method when the response data has missing entries at random locations. Our naive modification of their code produced very bad predictions. Thus, this method is not included in our analysis. For our method, the predictive value of the test data is estimated by averaging the predictive mean conditionally on the parameters and training data over samples generated from the training data posterior. We generate the data based on the following regression functions:

 f1(x) =10sin(πx1x2)+20(x3−0.5)2+(10x4+5x5)+x73:77\boldmathξ1, f2(x) =x73:77\boldmathξ2, f3(x) =0.1exp(4x1)+4/(1+exp(−20(x2−0.5)))+3x3+2x4+x5 +x73:77\boldmathξ3, f4(x) =10sin(πx1x2)+5sin(x3x4)+x5+x73:77\boldmathξ4, f5(x) =5x2/(1+x21)+3x3+2x4+x5, f6(x) =0.1exp(4x1)+4/(1+exp(−20(x2−0.5)))+20(x3−0.5)2 +(10x4+5x5)+x73:77\boldmathξ6, f7(x) =5x2/(1+x21)+20(x3−0.5)2+2x4+x5+x73:77\boldmathξ7,

where stands for the array of first 5 predictors from 9- group. The regression coefficients ’s are generated from . Additionally, does not have any contribution from the 1- group and group 9 does not contribute to . While generating the predictors ’s, we first generate from standard normal and transform those to as . Due to this rescaling step, the generated data is identical for any where . The same transformation is also applied on the nodal attributes in Section 5. The response matrix is generated as , where we generate latent factors as . The true loading matrix, is illustrated in Figure 3. The non-zero entries are generated from . We then mean-center the data before applying all the prediction methods.

Note that except for ‘Multivariate LASSO’ all the other methods are fitted for each response separately as the available software packages only support univariate response. The R package glmnet

for multivariate response with ‘family=“mgaussian”’ does not execute if the response matrix has missing entries at random location. We thus implement an expectation maximization (EM) algorithm based optimization program applying

glmnet in M-step and predict in E-step. For all the Bayesian methods, we collect 5000 MCMC samples after burn-in 5000 samples.

Prediction performances across all the methods are presented in Table 1. We compute prediction MSEs for each response separately and report the integrated average for all the other methods except for multivariate LASSO, as their current R implementations only support univariate responses. Performance of the proposed nonparametric regression model, referred to as RBF-net is overwhelmingly better than all other competitors. Among all the competitors, multivariate LASSO performs the best, which is yet not as good as our proposed non-parametric regression model.

Figure 3 illustrates estimated loading matrices on applying our proposed method (supervised case) and also estimates on applying factor model without adjusting for the covariates (unsupervised case). As in Roy et al. (2020), we plot to maintain the scale. For both of two cases of , recovery of the true loading structure under some permutation of columns is overwhelmingly better on using the supervised model, which is reasonable as the data is generated from a supervised model. Thus, it is important to adjust for the predictors for better estimation.

4.1 Selection comparison

We now compare selection performance of the proposed method with all the other competitors from the previous subsection using receiver operating characteristic (ROC) curves. In addition, we also include results using another nonparametric selection algorithm, called EARTH (Doksum et al., 2008). For forest and regression tree based methods, we collect importance measures and number of time a predictor is selected for a node. For group LASSO and multivariate LASSO, we collect estimated coefficients. For the EARTH algorithm, we apply evimp

function on the output to compute variable importance using the RSS criterion. These measures are computed for all seven response variables and averaged out to get a single vector of importance measures. For LASSO, we take the average over modulus of the estimated coefficients. These summaries estimates are then used to compute the ROC curves for each case. For our method, the ROC curve is prepared using the posterior inclusion probabilities,

. Figure 4 illustrates the ROC curves for the two choices of number of groups, . For SoftBART and DART, we consider 50 regression trees which produced the best selection performance. Figure 4 shows that the selection performance of our proposed method is overwhelmingly better.

5 Application to Human Connectome Project

We collected the raw data at https://connectomedb.com. To process the data, we use the Brainsuite software (Shattuck and Leahy, 2002). Using this software, we also construct structural connectivity matrices that compute the number of white fibers connecting two brain regions. In such processing, we require both T1-weighed and Diffusion-weighted imaging (DWI) images for each individual. The processing takes several standard steps. We provide a summarized description of the steps. In the first step, we process the T1-weighted images and register different brain regions based on the USCBrain atlas (Joshi et al., 2020) from the Brainsuite software. The second step computes the white matter tracts after aligning the DWI-images on the T1-images and subsequently, constructs the structural connectivity matrices.Our response variables are seven cognition related test scores as described in Section 1. Before applying all the methods, we mean-center and normalize all the responses. In this paper, we consider a sample of 100 individuals for computational convenience.

We then apply different functions from the R packages NetworkToolbox and brainGraph to compute the nine nodal attributes as described in Section 2.1 for each node. There are nodes in this atlas, leading to 1431 predictors. Instead of flooding the model with nodal attributes of all the brain regions, we consider a hybrid mechanism for the data application as in Liang et al. (2018). We first pre-select a set of brain regions and then apply our proposed method in the reduced set for our final inference. Since, the predictors are in groups, we modify the screening technique of Liang et al. (2018) and propose following alternative approach. Note that the following procedure also differs from Section 3.3 in Step (b).

• Compute the nonparanormal transformations (Liu et al., 2009) of the data for , where stands for the response vector with . Similarly, evaluate for with the same notational meanings from Section 3.3.

• For each group

, we fit a multiple linear regression model for

on using R function lm and obtain the p-value corresponding to the F-statistic. Define if the p-value is less than or equal to 0.05.

• Define the set .

The multiple linear regression step helps to identify if there is any linear dependence between and . We use the R package huge (Jiang et al., 2020) to compute a nonparametric estimate of ’s, ’s, and get the transformation. Applying above screening mechanism, our final set of selected nodes consists of 23 nodes which give us predictors.

Based on the posterior inclusion probabilities of each group, we list the 23 brain regions in decreasing order. Below are the top ten regions from that list along with some citations in brackets that studied or discussed its role in cognitive tasks. The regions are left anterior cingulate (Stevens et al., 2011), left pre central (Padmala and Pessoa, 2010), left lateral occipital (Shiohama et al., 2019), left posterior cingulate (Burgess et al., 2000), left fusiform (Tsapkini and Rapp, 2010), right inferior temporal (Schnellbächer et al., 2020), right temporal lobe (Chan et al., 2009), right superior temporal sulcus (Westphal et al., 2013), left lateral orbitofrontal (Jonker et al., 2015; Nogueira et al., 2017), and right supra marginal (Hartwigsen et al., 2010). There are six regions from left hemisphere and four from the right hemisphere. In the above list, The regions such as left anterior cingulate, left lateral orbitofrontal are known to be associated with decision-making. Among the other regions, left lateral occipital, left fusiform, right inferior temporal, right superior temporal sulcus, and left lateral orbitofrontal are responsible for different aspects of visual recognition. Left posterior cingulate is an important component of default mode network.

To draw inference on the nodal attributes, we look at the pulled effect of each attribute which is defined as , for the - attribute where and are the posterior inclusion probabilities of - group and - attribute in - group respectively. The posterior estimate stands for the posterior median of . Among the nodal attributes, we order the attributes in decreasing order of pulled effect based on ’s. The ordering follows local shortest path length of each node, efficiency, nodal impact, eccentricity of the maximal shortest path length, within-community centrality, betweenness centrality, degree, leverage centrality, and participation coefficient. Thus, top four most influential nodal attributes are all based on shortest paths between different pairs of ROIs. Effect of the shortest paths has been found to be influential for cognitive studies (Uehara et al., 2014). Path lengths represent the amount of processing required for transferring information among the brain regions (Rubinov and Sporns, 2010), and thus having shorter path lengths is advantageous for expeditious communications (Kaiser and Hilgetag, 2006). Similarly, we also compute pulled attribute effects applying the group LASSO method and illustrate the result in Table 10 of the supplementary materials. Efficiency, nodal impact and between-ness centrality are the top 3 most influential attributes based on aggregated effects. Thus, efficiency and nodal impact are identified as influential both by our method and group LASSO.

6 Discussion

In this article, we studied association between cognitive functioning and structural connectivity properties. For cognitive functioning, we analyzed the data on seven cognition related tasks. Structural connectivity properties are illustrated in terms of nine nodal attributes. Our primary inferential goals are to analyze the dependency patterns among the cognitive test scores, and study the effect of nodal attributes in cognition. Due to the inherent grouping among the nodal attributes, we considered a group variable selection approach. We developed a novel nonparametric regression model with group sparsity for multivariate response using a novel RBF-net architecture. Our proposed RBF-net based means share an importance coefficient, which governs the variable selection. The required group variable selection is thus performed by imposing a group sparsity prior on . We obtain excellent numerical results and show vast superiority in performance over several other competing methods. The loading patterns in Figure 5 reveal interesting dependency structures among the cognitive test scores. Based on the pulled effect of the nodal attributes, the shortest path length based attributes are found to more important than others in cognitive processing.

We may add additional nodal attributes, and even incorporate global attributes such as network small-worldness in the model as well as other subject specific covariates such as gender, age, etc. The correlation matrices ’s may be varied for different responses to ensure greater flexibility. Another direction will be to have more than one layer and develop a deep architecture. Extending our proposed architecture to a deep neural net is straightforward. Prediction performance with a deep neural net is also expected to be better at a higher computational expense. Future research will focus on improving upon the computational issues and devise a deep architecture that can improve both selection and prediction performances. To maintain simplicity in our analysis, we have considered the adjacency matrices to be binary while computing the nodal attributes. Alternatively, the nodal attributes could be evaluated taking into account the edge-weights of structural connectivity matrices. In our connectome application, greater weight implies stronger connectivity. Such connectivity pattern is also observed, for instance in collaboration networks among the scientists. Newman (2001) computed different nodal attributes taking inverse of the edge-weights as cost while studying a scientific collaboration network. We may take such approaches in our connectome application, taking number of connecting white matter fibers as weights.

We have not established any theoretical consistency properties of our Bayesian model. Such result may further strengthen our analysis. In variable selection problem, selection consistency is a desired theoretical property. Under some model identifiability condition, as in Liang et al. (2018), we may be able to establish variable selection consistency properties. Additionally, computational speed may be improved further with stochastically computed gradients. The newly developed stochastic gradient based MCMC methods from Song et al. (2020) may be incorporated in our computation to improve computational efficiency.

Beyond the presented application in HCP data, our method may be useful in other contexts. The key features of our prediction model include taking nodal attributes as predictors, and hence, imposing a group sparsity structure for feature selection. The model is further developed for multivariate response data. These innovations are potentially useful for any network based prediction problem, and may be beneficial in inference.

Supplementary materials

The supplementary materials contain proof of Lemma 1, detail to posterior computation steps, further exploratory analysis and R scripts to fit the models. Data on the normalized and mean-centered cognitive scores and nodal attributes are also provided, along with the names of the screened brain region names and names of the nodal attribute. The data on nodal attribute is already transformed using from Page 21. In the codes.R file, detail description of the attached data is provided. With this data Figure 1, Figure 5, prediction MSE due to RBF from Table 2 and all other inferences from Section 5 can be reproduced.

S.1 Proof of Lemma 1

The partial derivative of with respect to ,

 f′ℓ(x)=∂f(x)∂xℓ=−2K∑j=1λjdℓ\boldmathΩℓ⋅(x−\boldmathμj)Dexp{−(x−\boldmathμj)TD\boldmathΩD(x−\boldmathμj)}

The ‘if’-part is easy to verify. For the ‘only if’-part, we use continuity of . Note that . Due to continuity, there exists a such that . Since our assumption is that for all , we then must have .

S.2 Computation

The number of components, is selected as discussed in Section 2.3. Due to the binary parameters ’s and ’s, the candidates for the random-walk MH-steps, and computations of associated acceptance probabilities are modified. If, for example , the candidates for for all are generated from the prior . The same holds for for all . If , but , again the corresponding entries in ’s and ’s are updated similarly.

Our sampler iterates between the following steps.

1. Updating : The full conditional distribution for is given by .

2. Updating : As in the backfitting algorith of Hastie and Tibshirani (2000), we update ’s one-by-one. The full conditional of is where and we also have with

3. Updating : As discussed in Section 2.4, we update ’s using random walk M-H afterwards. We provide the candidate for the random walk M-H step.

• M-H sampling step: The negative log-likelihood is the same as above. The proposals are generated from truncated normals with the current value as the mean and variance are tuned to achieve a pre-specified range of acceptance. Specifically, the candidate is generated as and ’s are tuned after each 100 iterations such that the rate of acceptance is between 0.1 to 0.3.

• If indicators are zero: For all such that , the candidates for are generated from truncated normal as before. However, for all such that , we generate candidates for

from the uniform distribution which is the prior. The acceptance probabilities thus do not include the

’s and ’s such that as they are generated from the prior.

4. Updating polar angles ’s: For the random walk M-H update we generate candidate as for and . The negative log-likelihood is again

5. Updating : We update using HMC. Thus we again provide the negative log-likelihood and its derivative with respect to . The negative log-likelihood is given by,

 n∑i=1v∑j=1(yi,j−fj(xi))2/(2σ22,j)+∑ℓ∈mgβ2g,ℓ/(2v21).

The derivative is given by,

 K∑j=1n∑i=12λj(yi−f(xi)){% \boldmathΩjD(xi−\boldmathμj)}{(xi−\boldmathμj)}/σ2+d/v21.
6. Updating : The full conditional distribution for is given by .

Updating steps for , , and remain the same in the high dimension. The sampling steps for other parameters are also almost identical except for generating the candidates. Specifically, when for some , the corresponding parameters do not contribute to the model likelihood. Thus, they are generated from the prior. Specific steps are given below. We start the chain setting for all . The updating of begins only after iterations and from that stage onwards, we only consider random-walk-based M-H updates for all the other parameters. We update after each 10- iteration to speed up the computation without sacrificing the performance. We first run randomForest to preselect a set of predictors and run the low-dimensional model. Based on the estimates of the low-dimensional model, we initialize our model parameters in high-dimension. This ensures faster convergence.

1. Updating : Following Kuo and Mallick (1998), we update ’s in random ordering. When we update , we let , where . Here is the model likelihood setting and alternatively, is the model likelihood setting .

2. Updating : For all such that , the candidates for are generated from truncated normal as before. However, for all such that , we generate candidates for from the uniform distribution which is the prior. The acceptance probabilities thus do not include the ’s and ’s such that as they are generated from the prior.

3. Updating polar angles ’s: For all such that , the candidates for are generated from truncated normal as before for all . However, for all such that , we generate candidates for from either