I Introduction
Functional magnetic resonance imaging (fMRI) has been commonly used to noninvasively detect human brain activity by measuring the blood oxygenationlevel dependent (BOLD) changes. In the past decades, a large number of models have been developed to analyze brain functional connectivity (FC) using fMRI. By incorporating graph theory, the brain can be described as a graph, where a node represents a welldefined region of interest (ROI) and an edge represents a functional interconnection between the nodes (ROIs) it connects. Then, the brain FC can be mathematically estimated and quantified. Based on the representation form of the graph, methods for FC estimation can be roughly divided into two categories: undirected graphical models, and directed acyclic graph (DAG) models. The undirected graphical models characterize the brain FC through the association coefficients among various ROIs, which reflect their spatial relationships. For instance, the Pearson correlation is a typical measurement in undirected graphical models from static to dynamic FC estimations
[1, 2], which describes the linear correlation between the ROIs. In a complex system like the brain, the Pearson correlation is much weaker marginally [3]. That is, all nodes (ROIs) are directly or indirectly correlated and it is difficult to distinguish significant connections through a dense network constructed by Pearson correlations. To this end, partial correlations have been proposed to explore direct associations between two nodes, removing the effect of other random variables
[4, 5]. Approaches to characterize statistical associations are often used for estimating brain network interactions, but fail to reveal the direction of information flow behind these interactions. It is natural to shift the focus of FC from association to causal interactions for indepth research [6].Directed acyclic graph (DAG) models, also known as Bayesian networks, are designed to model causal relationships in complex systems. Causality models pinpoint the key connectivity characteristics and simultaneously remove some redundant features for diagnosis. Methods for DAG identification can be divided into four categories: constraintbased methods, scorebased methods, nonGaussian based methods, and hybrids of these categories. Constraintbased methods, such as the prominent PC algorithm [7], first learn the skeleton of the DAG from conditional independence relationships and then orient the edges. Scorebased methods, on the other hand, posit a scoring criterion for each possible DAG model, and then search for the graph with the highest score given the observations. An example is the greedy equivalence search (GES) algorithm proposed by Chickering [8], which greedily optimizes the penalized likelihood such as the Bayesian Information Criterion (BIC). Both the constraintbased and the scorebased methods are not optimal for predicting the direction of causal relationships but are accurate in identifying the causal skeleton (graph structure without directions) in fMRI study [9]. The nonGaussian based methods, which refer to the Linear NonGaussian Acyclic Model (LiNGAM), aim to estimate linear Bayesian networks for continuous data using the nonGaussian information. The key aspect of LiNGAM is the use of nonGaussian data, which makes it possible to identify more of the graph structure than the traditional Gaussian setting. Several methods have been proposed in the literature, such as ICALiNGAM [10], direct LiNGAM [11], and pairwise LiNGAM [12]. However, compared to the other two categories of methods, LiNGAM requires a larger number of data points in the relevant dimension to converge to the true graph.
In biomedical applications, we are often faced with small sample size problems where the number of variables/nodes greatly exceeds the number of samples/observations, i.e., high dimensional cases. As a result, both computational difficulty and convergence issue often arise based only on the observations. This is especially the case for the estimation of DAG models, which is in general nondeterministic polynomial time hard (NPhard) [13]. To this end, we propose a learning incorporated linear nonGaussian acyclic model (LiNGAM), particularly for highdimensional cases. The learning method [3] was first proposed to fast estimate the undirected graph with the equivalent partial correlations,i.e., the correlation. The method works well especially for highdimensional cases, with successful applications to gene regulatory network inference [14] and brain FC analysis by us recently [4]. Herein, we apply the learning method to refine the set of likely causal connectivity, facilitating causal inferences with fast computation. Following the same idea of LiNGAM (e.g., [11]), we assume that the observations are nonGaussian and continuous, and incorporate prior information from direct LiNGAM into the DAG estimation. To acquire the prior knowledge, we estimate the undirected graph first, since the skeleton of the directed graph is always included in it. To be more specific, we apply the nonparanormal transformation [15] first, which converts the problem to a Gaussian graphical model (GGM), and then adopt the learning to identify the undirected graph structure. The contributions of this paper are generally twofolds. Mathematically, we overcome the highdimensionality difficulty of the LiNGAM methods and increase the convergency speed to the true DAG, even under the condition of small sample size. Biomedically, we apply the proposed model to the restingstate fMRI (rsfMRI) data from PNC, aiming to explain the cognitive variance through the directed functional connectivity differences. Unlike previous studies using association models [1, 4], our work is able to orient the causal directions and identify three types of hub structures: inhub, outhub and sumhub from the directed graphs.
The remainder of this paper is organized as follows. In Section II, we first introduce some background knowledge of directed graphs and then describe the proposed LiNGAM method. Both simulation studies and the rsfMRI analysis using PNC data are shown in Section III, followed by some discussions and concluding remarks in the last two sections.
Ii Methods
In this section, we first introduce some basic concepts of DAGs and the relationship between directed and undirected graphs in Section II.A. Then we describe the LiNGAM and the method for prior knowledge estimation in Section II.B and II.C, respectively. Finally, we summarize the LiNGAM algorithm in Section II.D.
Iia Background: directed graph
A graph is composed of a node set and an edge set . In our setting, the nodes in set
correspond to the components of a random vector
. An edge is directed if but , and we denote it as and node is called a parent of node . A directed acyclic graph (DAG) means all the edges in are directed and there is no circle in graph (see Fig. 1 (a)). An edge is undirected if and , and we denote it as . An undirected graph is composed of undirected edges. The skeleton of a DAG , , is the undirected graph obtained from by substituting undirected edges for directed edges (see Fig. 1 (b)).The two graphical structures, directed and undirected graphs, express different independence properties in a system. Fig. 1 (c) gives an illustration of a directed/ undirected graph whose independence properties cannot be expressed by each other. A directed graph can be converted into an undirected graph, called a moral graph [16]. In general, we first add additional undirected links between all pairs of parents for each node in the graph and then drop the arrows on the original links to give the moral graph . The relationship among the skeleton of the DAG , the moral graph and the undirected graph is .
IiB LiNGAM
The linear nonGaussian acyclic model (LiNGAM) was first proposed by Shimizu et al. [10], which is a Bayesian network (BN) using a structural equation model (SEM) for nonGaussian variants. LiNGAM assumes that the casual relationships of the variables can be represented graphically by a DAG , where the node set represents the corresponding variables and denotes the directed edges. Let be the weighted adjacency matrix specifying the edge weights of the underlying DAG . The observed random vector is assumed to be generated from the following linear SEM,
where is a continuous random vector; the ’s,
have nonGaussian distributions with nonzero variances, and are independent of each other.
A property of acyclicity is that there exists at least one permutation of variables such that , . In other words, the weight matrix can be reordered to a strictly lower triangular matrix according to the permutation . The goal of LiNGAM is to find the correct permutation and estimate the weight matrix . Since the components of are independent and nonGaussian, Shimizu et al. [10]
first proposed the independent component analysis (ICA) based algorithm known as ICALiNGAM. However, like most ICA algorithms, the convergency of ICALiNGAM is affected by the initial state, and an appropriate parameter selection is not straightforward. The direct LiNGAM
[11] method has been developed without the need for initial guess or algorithmic parameters, which also has the guaranteed convergence. However, it requires a large number of samples to reach the true graph and the computational time is much longer than the ICALiNGAM. Fortunately, the direct LiNGAM is flexible to incorporate prior knowledge. With a proper selection of priors, both the convergence and the computation can be largely accelerated due to the significant shrinkage of searching space.IiC Prior knowledge estimation
We have discussed the relationship between directed graphs and undirected graphs in Section II.A. The undirected graph can be treated as the upper bound of the DAG, which can be estimated first as the prior information. Our goal in this step is to eliminate as many confounding edges as possible and only keep the direct associated edges, and therefore we select partial correlation coefficients as the measure of dependencies. Gaussian graphical models (GGMs) have become a popular tool for the estimation of undirected graphs using partial correlation coefficients due to their mathematical simplicity. To be more specific, let denote a dimensional random vector following a multivariate Gaussian distribution , where and denote the unknown mean and the covariance matrix, respectively. It has been proven that the partial correlation between and on the condition of all other variables can be expressed as
(1) 
where is the precision matrix (i.e., the inverse of the covariance matrix ), denotes the (,)th entry of and denotes the index set of variables [17], [18]. Thus, the construction of undirected graphs using GGMs is equivalent to the estimation of the precision matrices.
In our model, the variables are assumed to be nonGaussian continuous, thus, the original GGMs with Gaussian assumption cannot be directly applied. To this end, we adopt the nonparanormal transformation proposed by Liu et al. [15], known as a semiparametric Gaussian copula model, to render data normal. Subsequently, we adopt the learning method [3] to infer the GGM structure due to its computational efficiency and flexible framework as demonstrated by our previous work [4].
IiD LiNGAM
We now illustrate the LiNGAM we proposed for high dimensional DAG estimation. First, we follow the same definition of the prior matrix as in [11], which is given as follows:
(2) 
The procedure of the LiNGAM algorithm is summarized in Algorithm 1. The nonparanormal transformation can be implemented through the R package huge. The R package equSA is designed for the learning method. Since we intend to find the upper bound of the DAG, we set a broader significance level with . The code for direct LiNGAM is publicly available at https://sites.google.com/site/sshimizu06/lingam.
Iii Results
Iiia Simulation studies
In this section, we evaluated our proposed model through a series of simulation studies. We simulated the random DAG through the R package pcalg
with the edge probability
, where is an edge degree parameter and is the total number of variables. Given , we assigned uniformly random weights to the edges to obtain the weighted adjacency matrix : , if ; otherwise . Given , we generated from two nonGaussian noise selections: Exponential (Exp) and Chisquared (Chisq) noises. The exponential noise is set to have rate , i.e., ,, and the chisquared noise is set to be central with degree of freedom
, i.e. , . We illustrated the performance of the proposed LiNGAM under the cases of small sample size and high dimensionality. We set and sampled the random vectors for each noise selection with and the degree parameter . For each scenario, we simulated 10 datasets independently. We assessed the performance of the four methods through the true positive rate (TPR), false discovery rate (FDR), and structural hamming distance (SHD) [19]. SHD is a commonly used metric based on the number of operations needed to transform the estimated DAG into the true graph [20]. In simple terms, SHD counts the total number of edge insertions, deletions or flips in the transformation.Fig. 2 shows the performance of LiNGAM under small sample size and highdimensional settings with (i.e., ). As the graph gets denser ( increases) or larger ( increases), although the TPR descends the FDR remains at a low range. To further demonstrate the superiorty of the LiNGAM, we compared it with existing methods, which were the PC algorithm [7], GES [8], the ICALiNGAM [10] and the direct LiNGAM [11] with (i.e., ) in Appendix.A. (The two LiNGAM methods cannot be applied for highdimensional cases.) From the results between LiNGAM and direct LiNGAM, LiNGAM has significantly solved the convergence speed issue. In fact, the LiNGAM has outperformed the other methods under each setting, especially under large variable numbers and/or low degree parameter settings (see Appendix.A Figure A, B). Compared to direct LiNGAM, the computational complexity of LiNGAM is also favorable. The mean CPU times of the two methods under various variable sizes ’s on a 4 GHz computer are shown in TABLE I.
Method  

LiNGAM  
DirectLiNGAM 
IiiB Brain network analysis based on fMRI
IiiB1 Data description
In this work, we intend to investigate the relationship between the variance of cognitive abilities and the brain functional activity. The dataset we used is the publicly available Philadelphia Neurodevelopmental Cohort (PNC), which is a largescale collaborative study of the Brain Behavior Laboratory at the University of Pennsylvania and the Children’s Hospital of Philadelphia [21]. The data contains (among other fMRI modalities [22]) rsfMRI for nearly subjects aged from to years old. All MRI scans were acquired on a single 3T Siemens TIM Trio wholebody scanner. Blood oxygenation leveldependent (BOLD) fMRI was acquired using a wholebrain, singleshot, multislice, gradientecho (GE) echoplanar (EPI) sequence of volumes (372s) with the following parameters TR/TE=3000/32 ms, ﬂip , FOV mm, matrix , slice thickness/gap . The resulting nominal voxel size was mm [22]. Standard preprocessing steps were applied using SPM12 (https://www.fil.ion.ucl.ac.uk/spm/software/spm12/), including motion correction, spatial normalization to standard Montreal Neurological Institute (MNI) space, and spatial smoothing with a mm full width at half max (FWHM) Gaussian kernel. The effect of head motion ( parameters) was further regressed out and BOLD time series went through a bandpass filter ( Hz to Hz) to suppress physiological artifacts [23]. Finally, we reduced the dimensionality of the data by employing the standard ROIs template proposed by Power et al. [24] with a mm sphere radius. For the ROIs, we divide them into 12 functional network (FN) modules, including sensory/somatomotor network (SSN), cinguloopercular task control network (CON), auditory network (AUD), default mode network (DMN), memory retrieval network (MRN), visual network (VN), frontoparietal task control network (FPN), salience network (SN), subcortical network (SCN), ventral attention network (VAN), dorsal attention network (DAN) and cerebellum network (CERE).
Beyond imaging, all subjects underwent a hour long computerized neurocognitive battery (CNB) adapted from tasks applied in functional neuroimaging studies to evaluate a broad range of cognitive domains [25]. For this work, we relied on performance scores (ratio of total correct responses) from the wide range achievement test (WRAT), which measures an individual’s learning ability (reading, spelling, and mathematics) [26], and provides an estimate of intelligence quotient (IQ). To mitigate the influence of age over the final results, we excluded subjects whose ages were below 18 years old [27]
. We then converted the WRAT scores to zscores based upon each subject’s raw score and the sample mean in order to provide a standard metric. We only kept subjects whose absolute zscore values were above 0.5, i.e.,
, to better compare and explain the brain cognitive differences. Finally, subjects were left for analysis and divided into 2 groups according to the WRAT scores (high WRAT group: subjects/low WRAT group: subjects). The detailed demographic information is summarized in TABLE II.Group  Mean WRAT (SD)  Age Range  Sex Ratio (M:F) 

High  
Low 
Characteristics of the subjects in this study. Here SD denotes the standard deviation, M and F represent male and female respectively.
IiiB2 Results
First of all, we calculated the Anderson–Darling score as suggested in [28] to ensure that the rsfMRI data meet our nonGaussianity assumption. For each subject, we then applied the LiNGAM method, and obtained the corresponding weighted directed adjacency matrix at the significance level (with FDR correction). We extracted three network statistics, as suggested in [5], i.e., density, transitivity, and global efficiency (GE), to reflect various aspects of the directed brain functional network. Density is the fraction of present connections to all possible connections, which also shows the mean network degree. The degree of a node is defined as the number of edges connected to the node. For a directed graph, the degree of a node is composed of two elements: indegree and outdegree, which are named after the directions of the edges. Transitivity calculates the ratio between the number of triangles and the number of connected node triples. Transitivity is a measure of the brain functional segregation, i.e., the ability for specialized processing to occur within densely interconnected groups of brain regions [29]. GE calculates the average inverse shortest path length, which indicates the ability to combine specialized information from distributed brain regions, also known as the functional integration ability. The network statistics are calculated through the Brain Connectivity Toolbox (https://sites.google.com/site/bctnet/). In TABLE III, we find that although the brain network density of the low WRAT group is much larger than that of the high WRAT group, there are no big differences between them according to transitivity and global efficiency.
Group  Density  Transitivity  Global Efficiency 

High  
Low 
To identify group structures, we conducted a onesample ttest for each group to select the significant edges at significance level
and set threshold of the weights (i.e., the mean weights should be above ), which guarantee the results to be both statistically significant and numerically noticeable. Finally, and pairs of directed brain connections were identified for the high and low WRAT groups, respectively (see Fig. 3). To gain more insights into the obtained networks, we analyzed the inferred hub nodes in three ways: inhubs, outhubs and sumhubs. Here we define hubs as the nodes with degrees at least two standard deviation higher than the mean degrees ([30]). The inhubs (centers that receive information) and outhubs (central nodes that convey out information) are selected through the in and out degrees, respectively, while the sumhubs are the ones that do not belong to the in and out hubs but their sum degrees are high, i.e., transitional centers. As shown in TABLE IV, there is one inhub ROI located in the right inferior frontal gyrus, triangular (IFGT.R) and one outhub ROI located in the right middle occipital gyrus (MOG.R) for the high WRAT group. For the low WRAT group, inhub ROIs have been selected, which are located in the right middle frontal gyrus (MFG.R), right precentral gyrus (PREG.R), right inferior temporal gyrus (ITG.R), and right middle temporal gyrus (MTG.R); outhub ROI is located at the right superior temporal gyrus (STG.R). There are no sumhubs identified for both groups.High WRAT Group  
Type  Index  MNI  AR  FN  Degree 
Inhub  175  IFGT.R  FPN  
Outhub  157  MOG.R  VN  
Low WRAT Group  
Type  Index  MNI  AR  FN  Degree 
Inhub  206  MFG.R  SN  
205  PRE.R  SN  
11  ITG.R    
84  MTG.R    
Outhub  235  STG.R  VAN 
To further compare the differences between the high and low WRAT groups, we conducted a Welch two sample ttest at significance level and Cohen’s d effect size statistics [31]. According to Cohen [32], the difference is negligible when . Thus, we chose and selected significant edges that perform differently between the two groups. The numbers of edges (features) for various strength levels of difference are given in TABLE V. The hub ROIs discovered with have been summarized in TABLE VI. Among them, inhub ROIs are located at the right postcentral gyrus (POST.G), left precuneus (PQ.L), right precuneus (PQ.R) and left middle occipital gyrus (MOG.L); outhub ROIs are located at the left rolandic operculum (RO.L) and right putamen (PUT.R); sumhub ROIs are located at the right lingual gyrus (LING.R), left calcarine sulcus (CAL.L), and right thalamus (THA.R).
# of edges 

Type  Index  MNI  AR  FN  Degree 
Inhub  25  POST.R  SSN  
90  PQ.L  DMN  
251  PQ.R  DAN  
147  MOG.L  VN  
Outhub  70  RO.L  AUD  
229  PUT.R  SCN  
Sumhub  148  LING.R  VN  
146  CAL.L  VN  
225  THA.R  SCN 
IiiB3 Support Vector Machine based Classification
With the available phenotype information (WRAT score high/low), we did classification based on our selected features (significantly different connections) as a validation to our discovery. Specifically, we applied linear support vector machine (SVM) with default parameter value
[33] and compared the results with various inputs, which are given as follows:
Pearson correlation ([1]): Calculate the Pearson correlation matrix for each subject and use the upper triangular elements of the matrix as the input.

Partial correlation ([3]): Apply the learning method to calculate the highdimensional partial correlation matrix and use the upper triangular elements of the matrix as the input.

PC ([7]): Apply the PCalgorithm and acquire the adjacency matrix of the completed partially directed acyclic graph (CPDAG), which contains both undirected and directed edges. Use all the elements from the adjacency matrix as the input.

LiNGAM: Apply the proposed LiNGAM method to each subject and acquire the weighted adjacency matrix . Use all the elements from as the input.

: Apply the proposed LiNGAM method to each subject first, then compare the differences between the two groups as described in III.B 2). Set
as the threshold for the feature selection (significantly different connections) and only choose the elements of the selected features from
as the input. 
: Same as e., except for setting as the threshold.

: Same as e., except for setting as the threshold.

: Same as e., except for setting as the threshold.
All the algorithms were implemented in R. The learning method is available in the R package equSA. The PCalgorithm was implemented through the R package pcalg. The SVM algorithm was implemented through the R package e1071. A 5fold crossvalidation (CV) procedure was implemented to evaluate the classification performance in all experiments. All subjects were randomly partitioned into 5 disjoint subsets with similar class distributions and size
. Each subset in turn was used as the test set and the remaining 4 subsets were used to train the SVM classifier. The classifier accuracy was estimated by comparing against the groundtruth labels on the test set. We then averaged the 5 individual accuracy measures as one test result from the CV. The whole process was repeated 20 times to acquire the final mean accuracy rate (ACC) for each input. In TABLE VII, we find that generally using the directed information (PC,
LiNGAM) as inputs is better than using the undirected associations, however, the improvement is not very big. We then selected important features based on the different connections and set various thresholds to compare classification results. After feature selection from the LiNGAM result, the ACC has been improved. Especially, when we set and chose features as the input, the ACC has increased dramatically, which suggests that these features captured the major differences of brain connectivity between the high and low WRAT groups. The details of the causal connections are shown in Fig. 4 and TABLE VIII. When , i.e., features were selected, the ACC was the highest; it indicates that the selected features can explain the most cognitive variance between the two groups.Input  Pearson  Partial  PC  LiNGAM 

ACC  
Input  
ACC 
ROI  Direction  ROI  
Index  MNI  AR (L/R)  FN  Index  MNI  AR (L/R)  FN  
6  Parahippocampus (L)    134  Precuneus (L)  MRN  
234  Thalamus (R)  SCN  
7  Parahippocampus (R)    10  Inferior temporal gyrus (R)    
98  Superior frontal gyrus,  DMN  95  Precuneus (R)  DMN  
medial (L)  
154  Inferior occipital lobe (L)  VN  
100  Middle frontal gyrus (L)  DMN  151  Lingual gyrus (L)  VN  
109  Superior frontal gyrus,  DMN  85  Insula (R)    
medial orbital (L)  
132  Inferior frontal gyrus,    183  Cerebellum    
orbital (L)  crus 1 (L)  
145  Calcarine (R)  VN  94  Middle cingulate gyrus (L)  DMN  
180  Superior frontal gyrus,  FPN  13  Precuneus (L)  SSN  
orbital (R)  
186  Precentral gyrus (R)  FPN  175  Inferior frontal gyrus,  FPN  
triangular (R)  
235  Superior temporal gyrus (R)  VAN  120  Middle temporal gyrus (L)  DMN  
239  Middle temporal gyrus (R)  VAN  236  Middle temporal gyrus (L)  VAN  
247  Fusiform gyrus (R)    105  Superior frontal gyrus,  DMN  
medial (R)  
256  Superior parietal gyrus (R)  DAN  257  Middle temporal gyrus (R)  DAN  
262  Inferior temporal gyrus (L)  DAN  263  Superior parietal gyrus (L)  DAN 
IiiB4 Summary
By applying the LiNGAM method, we obtained each subject’s weighted adjacency matrix . From the mean network statistics extracted from of each group, we found that although the mean densities of the two groups were significantly different (low high), there were no big differences in terms of transitivity and GE. In other words, the ability to combine and process specialized information was similar between the low and high WRAT groups; however, populations in the low WRAT group used more brain connections. The groupwise directed connectivity is shown in Fig. 3 and the detailed hub information is given in TABLE IV. We found that both groups had an outhub whose degree was relatively large, respectively. To be more specific, for the high WRAT group, the ROI located at the right middle occipital gyrus (MOG.R) has significant influences on other areas at rest, while the ROI located at the right superior temporal gyrus (STG.R) has sent extensive information to other ROIs. To compare the differences between groups, as suggested from the classification results, the selected features explained the most cognitive variance when setting . We then identified inhubs, outhubs and sumhubs (see TABLE VI) from the different connections. Combining all the results above, functional modules (the VN, SN and SCN) have been mostly involved in the cognitive differences. Besides, we concluded that the left middle occipital gyrus (MOG.L), precuneus (PQ), and the right postcentral gyrus (POST.G) are influenced differently; left rolandic operculum (RO.L) and right putamen (PUT.R) have aberrant outgoing performance; the right lingual gyrus (LING.R), left calcarine sulcus (CAL.L), and right thalamus (THA.R), as the transitional areas, perform variously at different cognitive levels. TABLE VIII presents most important pairs of causal flows.
Iv Discussion
In the procedure of identifying group brain connectivity patterns and feature selection (significantly different edges), we have used both significance tests (onesample and twosample ttests) and effect size statistics (regression parameters and Cohen’s d statistics). The reasons we did both are described as follows. Firstly, the sample sizes of the two groups are different, while the tvalue and the corresponding pvalue are dependent on the sample size, which makes the statistical results not comparable. The effect size is not dependent on the sample size, hence comes to be a rational choice. Secondly, effect size statistics provide a uniform standard, but the significance of the value stays unknown. In our study, we applied both to make sure that the results were statistically significant and numerically meaningful.
Some of our findings have been already validated in the literature. For instance, considerable prior studies have implicated that the precuneus (PQ), as a central node in the human brain, exhibits heightened connectivity within both DMN and taskpositive networks while at rest [34], and is important for supporting complex cognition and behavior [35]. The study in [36] showed that precuneus played an important role in integrating both internally and externally driven information. Kilory et al. [37] have found positive relationships between PQ and intelligence quotient (IQ). All the evidence agrees with our finding that PQ, as the inside of the causal connections, is affected differently at rest for various cognitive ability (IQ) ranges. The postcentral gyrus (POST), otherwise known as the primary somatosensory cortex, has been reported to correlate with cognitive abilities [38, 39]. Our results give a possible explanation that POST, as an inhub of different causal connectivity in various cognitive groups, receives signals aberrantly from other ROIs. In this sense, the outhub features (centers that convey out information differently) should attract more attention. Among them, Putamen (PUT) has been found to relate to many regions of the cerebral cortex, as well as the thalamus, claustrum, and others through various pathways. In many studies, it has become apparent that the putamen is involved in various types of learning, such as reinforcement and implicit learning [40] and category learning [41], and thus relates to IQ [42]. Our conclusion agrees with the previous studies and suggests that putamen is a center for sending out information that causes the cognitive variance. The superior temporal gyrus (STG), has been discovered to be an important structure in the pathway consisting of the amygdala and prefrontal cortex, which are all involved in cognition [43, 44]. It is essential to the function of language in individuals who may have an impaired vocabulary, or are developing a sense of language. Several fMRI studies have suggested a link between insight based problem solving and activity in the right anterior superiortemporal gyrus (STG.R) [45], which was identified as the outhub ROI of the low WRAT group in our analysis. The thalamus has been generally believed to act as a relay station, relaying information between different subcortical areas and the cerebral cortex [46], which strongly correlates with level of intelligence [47].
We have discovered some hub nodes in the occipital lobe, more precisely at the middle occipital gyrus (MOG), left rolandic operculum (RO.L) and the right lingual gyrus (LING.R). The lingual gyrus is a structure in the visual cortex, which plays an important role in the identification and recognition of words [48], and has been found to correlate with cognitive variance [39]. The roles of MOG and RO.L in cognitive differences are new discoveries, which may suggest potential functional basis of cognition variances but still need further validations.
V Conclusion
In this paper, we propose a learning incorporated linear nonGaussian Acyclic model (LiNGAM) to study the casual interactions in human brain. As suggested by its name, the proposed method assumes nonGaussianity of the data and believes that the nonGaussian information can help to identify the direction of the edge. The main contributions of our work can be summarized as follows. First, the proposed method well integrates both the undirected graph and the directed acyclic graph, in the sense that we incorporate the undirected graph estimation as prior information into the direct LiNGAM model to perform DAG construction. Since the DAG is a subset of the undirected graph, the first step of prior screening can mitigate the irrelevant information but still maintain the true subset, which speeds up both numerical convergence and computation. Second, the simulation results in III. A show that the proposed method is stable with different settings and demonstrates improved performance in graph structure detection compared with other competing methods. Therefore, LiNGAM is a more solid statistical model for DAG estimation. Third, the proposed method has been applied to rsfMRI data from PNC, to explain brain functions through directed FC differences. We have identified significant causal connections that perform differently between the high and low WRAT groups. Three types of hub structures were found (see TABLE VI). Several of our hub ROIs have already been validated to have correlations with the cognitive variance. Our findings can further illustrate the roles of the hub ROIs in the brain causal flow. Finally, the classification results in III.B 3) prove the reliability of our discoveries, which also indicate that it can be widely applied to feature selection tasks in many other neuroimaging studies.
Future work for this line of research includes the following. First, for more reliable biomedical discovery, more datasets are needed to cross validate our findings. Second, the PNC data contains nearly subjects of age from late childhood to young adulthood. In this work, we only used the subjects over years old. It would also be interesting to track functional causal brain connectivity development over various age periods, which requires a joint DAG estimation model such as [49] or timeseries DAG model. We are currently working on the expansion of the LiNGAM method for causal inferences from multiple functional connectivity networks at different brain development stages, which will be reported elsewhere.
Vi Acknowledgment
The work has been funded by NIH (R01GM109068, R01MH104680, R01MH107354, P20GM103472, 2R01EB005846, 1R01EB006841, R01MH121101, R01MH103220), and NSF (#1539067).
References
 [1] B. Cai, P. Zille, J. M. Stephen, T. W. Wilson, V. D. Calhoun, and Y.P. Wang, “Estimation of dynamic sparse connectivity patterns from resting state fmri,” IEEE transactions on Medical Imaging, vol. 37, pp. 1224–1234, 2018.

[2]
G. Zhang, B. Cai, A. Zhang, J. M. Stephen, T. W. Wilson, V. D. Calhoun, and Y.P. Wang, “Estimating dynamic functional brain cconnectivity with a sparse hidden markov model,”
IEEE transactions on medical imaging, vol. 39, pp. 488–498, 2019.  [3] F. Liang, Q. Song, and P. Qiu, “An equivalent measure of partial correlation coefficients for high dimensional gaussian graphical models,” J. Amer. Statist. Assoc., vol. 110, pp. 1248–1265, 2015.
 [4] A. Zhang, J. Fang, F. Liang, V. D. Calhoun, and Y. Wang, “Aberrant brain connectivity in schizophrenia detected via a fast gaussian graphical model,” IEEE Journal of Biomedical and Health Informatics, vol. 23 (4), pp. 1479–1489, 2018.
 [5] A. Zhang, B. Cai, W. Hu, B. Jia, F. Liang, T. Wilson, J. Stephen, V. Calhoun, and Y. Wang, “Joint bayesianincorporating estimation of multiple gaussian graphical models to study brain connectivity development in adolescence,” IEEE transactions on medical imaging, vol. 39, pp. 357–365, 2019.
 [6] A. T. Reid, D. B. Headley, R. D. Mill, R. SanchezRomero, L. Q. Uddin, D. Marinazzo, D. J. Lurie, P. A. ValdésSosa et al., “Advancing functional connectivity research from association to causation,” Nature neuroscience, vol. 1(10), 2019.
 [7] P. Spirtes, C. Glymour, R. Scheines, D. Heckerman, C. Meek, G. Cooper, and T. Richardson, Causation, prediction, and search. MIT press, 2000.

[8]
D. Chickering, “Optimal structure identification with greedy search,”
Journal of machine learning research
, vol. 3, pp. 507–554, 2002.  [9] S. Smith, K. Miller, G. SalimiKhorshidi, M. Webster, C. Beckmann, T. Nichols, J. Ramsey, and M. Woolrich, “Optimal structure identification with greedy search,” Network modelling methods for FMRI. Neuroimage, vol. 54(2), pp. 875–891, 2011.
 [10] S. Shimizu, P. Hoyer, A. Hyvärinen, and A. Kerminen, “A linear nongaussian acyclic model for causal discovery,” Journal of Machine Learning Research, vol. 7, pp. 2003–2030, 2006.
 [11] S. Shimizu, T. Inazumi, Y. Sogawa, A. Hyvärinen, Y. Kawahara, T. Washio, P. Hoyer, and K. Bollen, “Directlingam: A direct method for learning a linear nongaussian structural equation model,” Journal of Machine Learning Research, vol. 12, pp. 1225–1248, 2011.
 [12] A. Hyvärinen and S. Smith, “Pairwise likelihood ratios for estimation of nongaussian structural equation models,” Journal of Machine Learning Research, vol. 14, pp. 111–152, 2013.
 [13] D. Heckerman, D. Geiger, and D. Chickering, “Learning bayesian networks: The combination of knowledge and statistical data,” Machine learning, vol. 20(3), pp. 197–243, 1995.
 [14] B. Jia, S. Xu, G. Xiao, V. Lamba, and F. Liang, “Learning gene regulatory networks from next generation sequencing data,” Biometrics, vol. 73, no. 4, pp. 1221–1230, 2017.
 [15] H. Liu, J. Lafferty, and L. Wasserman, “The nonparanormal: Semiparametric estimation of high dimensional undirected graphs,” Journal of Machine Learning Research, vol. 10, pp. 2295–2328, 2009.
 [16] C. M. Bishop, Pattern recognition and machine learning. springer, 2006.
 [17] A. P. Dempster, “Covariance selection,” Biometrics, vol. 28, pp. 157–175, 1972.
 [18] S. Lauritzen, Graphical Models. Oxford: Oxford University Press, 1996.
 [19] I. Tsamardinos, L. Brown, and C. Aliferis, “The maxmin hillclimbing bayesian network structure learning algorithm,” Machine learning, vol. 65, pp. 31–78, 2006.
 [20] M. Kalisch and P. Bühlmann, “Estimating highdimensional directed acyclic graphs with the pcalgorithm,” Journal of Machine Learning Research, vol. 8, pp. 613–636, 2007.
 [21] T. D. Satterthwaite et al., “The philadelphia neurodevelopmental cohort: A publicly available resource for the study of normal and abnormal brain development in youth,” Neuroimage, vol. 124, pp. 1115–1119, 2016.
 [22] T. D. Satterthwaite, M. A. Elliott, K. Ruparel, J. Loughead, K. Prabhakaran, M. E. Calkins, R. Hopson, C. Jackson, J. Keefe, M. Riley, and F. D. Mentch, “Neuroimaging of the philadelphia neurodevelopmental cohort,” Neuroimage, vol. 86, pp. 544–553, 2014.
 [23] B. Cai, G. Zhang, W. Hu, A. Zhang, P. Zille, Y. Zhang, J. M. Stephen, T. W. Wilson, V. D. Calhoun, and Y.P. Wang, “Refined measure of functional connectomes for improved identifiability and prediction,” Human brain mapping, vol. 40, pp. 4843–4858, 2019.
 [24] J. D. Power, D. A. Fair, B. L. Schlaggar, and S. E. Petersen, “The development of human functional brain networks,” Neuron, vol. 67(5), pp. 735–748, 2010.
 [25] T. Moore, S. Reise, R. Gur, H. Hakonarson, and R. Gur, “Psychometric properties of the penn computerized neurocognitive battery,” Neuropsychology, vol. 29(2), p. 235, 2015.
 [26] G. S. Wilkinson and G. J. Robertson, WRAT 4: wide range achievement test. Psychological Assessment Resources Lutz, FL, 2006.
 [27] L. Xiao, J. Stephen, T. Wilson, V. Calhoun, and Y.P. Wang, “Alternating diffusion map based fusion of multimodal brain connectivity networks for iq prediction,” IEEE Transactions on Biomedical Engineering, vol. 66 (8), pp. 2140–2151, 2018.
 [28] J. Ramsey, R. SanchezRomero, and C. Glymour, “Nongaussian methods and highpass filters in the estimation of effective connections,” Neuroimage, vol. 84, pp. 986–1006, 2014.
 [29] M. Rubinov and O. Sporns, “Complex network measures of brain connectivity: uses and interpretations,” Neuroimage, vol. 52, pp. 1059–1069, 2010.
 [30] J. Fang et al., “Fast and accurate detection of complex imaging genetics associations based on greedy projected distance correlation,” IEEE Transactions on Medical Imaging, vol. 37, pp. 860–870, 2018.
 [31] J. Cohen, Statistical power analysis for the behavioral sciences. Abingdon. United Kingdom: Routledge, 1988.
 [32] ——, “A power primer,” Psychological bulletin, vol. 112(1), p. 155, 1992.
 [33] B. Schölkopf, A. J. Smola, F. Bach et al., Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002.
 [34] R. Leech, S. Kamourieh, C. F. Beckmann, and D. J. Sharp, “Fractionating the default mode network: distinct contributions of the ventral and dorsal posterior cingulate cortex to cognitive control,” Journal of Neuroscience, vol. 31(9), pp. 3217–3224, 2011.
 [35] A. V. Utevsky, D. V. Smith, and S. A. Huettel, “Precuneus is a functional core of the defaultmode network,” Journal of Neuroscience, vol. 34(3), pp. 932–940, 2014.
 [36] A. E. Cavanna and M. R. Trimble, “The precuneus: a review of its functional anatomy and behavioural correlates,” Brain, vol. 129(3), pp. 564–583, 2006.
 [37] E. Kilroy et al., “Relationships between cerebral blood flow and iq in typically developing children and adolescents,” Journal of cognitive science, vol. 12(2), p. 151, 2011.
 [38] W. E. Brown, S. R. Kesler, S. Eliez, I. S. Warsofsky, M. Haberecht, and A. L. Reiss, “A volumetric study of parietal lobe subregions in turner syndrome,” Developmental medicine and child neurology, vol. 46, no. 9, pp. 607–609, 2004.
 [39] W. Johnson, R. E. Jung, R. Colom, and R. J. Haier, “Cognitive abilities independent of iq correlate with regional brain structure,” Intelligence, vol. 36, no. 1, pp. 18–28, 2008.
 [40] H. Yamada, N. Matsumoto, and M. Kimura, “Tonically active neurons in the primate caudate nucleus and putamen differentially encode instructed motivational outcomes of action,” Journal of Neuroscience, vol. 24, no. 14, pp. 3500–3510, 2004.
 [41] S. W. Ell, N. L. Marchant, and R. B. Ivry, “Focal putamen lesions impair learning in rulebased, but not informationintegration categorization tasks,” Neuropsychologia, vol. 44, no. 10, pp. 1737–1751, 2006.
 [42] P. A. MacDonald, H. Ganjavi, D. L. Collins, A. C. Evans, and S. Karama, “Investigating the relation between striatal volume and iq,” Brain imaging and behavior, vol. 8, no. 1, pp. 52–59, 2014.
 [43] R. Adolphs, “Is the human amygdala specialized for processing social information?” Annals of the New York Academy of Sciences, vol. 985, no. 1, pp. 326–340, 2003.
 [44] E. D. Bigler, S. Mortensen, E. S. Neeley, S. Ozonoff, L. Krasny, M. Johnson, J. Lu, S. L. Provencal, W. McMahon, and J. E. Lainhart, “Superior temporal gyrus, language function, and autism,” Developmental neuropsychology, vol. 31, no. 2, pp. 217–238, 2007.
 [45] M. JungBeeman, E. M. Bowden, J. Haberman, J. L. Frymiare, S. ArambelLiu, R. Greenblatt, P. J. Reber, and J. Kounios, “Neural activity when people solve verbal problems with insight,” PLoS biology, vol. 2, no. 4, p. e97, 2004.
 [46] M. Gazzaniga and R. B. Ivry, Cognitive Neuroscience: The Biology of the Mind: Fourth International Student Edition. WW Norton, 2013.
 [47] G. Pergola, C. Bellebaum, B. Gehlhaar, B. Koch, M. Schwarz, I. Daum, and B. Suchan, “The involvement of the thalamus in semantic retrieval: a clinical group study,” Journal of cognitive neuroscience, vol. 25, no. 6, pp. 872–886, 2013.
 [48] A. Mechelli, G. W. Humphreys, K. Mayall, A. Olson, and C. J. Price, “Differential effects of word length and visual contrast in the fusiform and lingual gyri during,” Proceedings of the Royal Society of London. Series B: Biological Sciences, vol. 267, no. 1455, pp. 1909–1913, 2000.
 [49] Y. Wang, S. Segarra, and C. Uhler, “Highdimensional joint estimation of multiple directed gaussian graphical models,” arXiv preprint, vol. arXiv:1804.00778, 2018.