1 Introduction
Since the 21st century, the European Union (EU) and the United States (US) have launched costly brain research projects, the Human Brain Project (HBP) of the EU and the BRAIN Initiative (BRAIN) of the US. Guided by the achievements of neuroscience research and with the help of computers, HBP will build a unified multiscale brain model to simulate the brain at all levels of brain structure and function, so that neuroscientists can contact genes, molecules, cells with cognitive behaviors. Further understanding of the brain, it will open up some new ways for the treatment of nervous system diseases and the development of information technologyAmunts2016The
. BRAIN is considered to be comparable to the Human Genome Project. Its goal is to study all neurons in brain activity, draw detailed maps of neural circuits on mesoscopic scale, and explore the relationship between neurons, neural circuits and brain function
Martin2016The . Subsequently, a worldwide upsurge of brain science research was launched. Until now, major countries and organizations around the world have launched their own Brain PlanPoo2016China ; Committee2016Australian ; Okano2016Brain ; Jeong2016Korea, and Neuron, a professional journal of neuroscience, set up a special issue to introduce the Global Brain Plan. Despite their different emphasis, all these projects indicate that neuroscience has increasingly focused on the functional integration at different spatial scales, that is, functional specialization has transformed to functional integration in complex neural networks
bressler1995large ; Rubinov2010Complex ; sporns2011human ; Siegel2012Spectral ; Helfrich2016Spectral ; Valk2017Socio . Identically, as we gradually understand the complexity of neural network in human brain, and combining the prior knowledge of laws of thermodynamics, isolated functional specialized smallnetworks in human brain are unlikely to existSporns2004Organization ; Bullmore2009Complex . That means subnetworks must be connected to other network to work inside the complex network. In a word, the establishment of a neural circuit between networks is very crucial for current neuroscience, and exploring the directionality of causal networks need to be more effectivelystrogatz2001exploring ; sporns2013making .Since the concept of Granger causality was originally introduced by Wiener and Grangerwiener1956theory ; Granger1969Investigating , the GCA methods have become widely applied in neuroscience to diverse sources of data, including electroencephalography (EEG)David2008Lexical ; Olivier2008Identifying ; van2014alpha ; Bastos2015Visual , local field potentials (LFP)brovelli2004beta ; Wang2008Estimating ; Dhamala2008Analyzing ; Bastos2015Visual , magnetoencephalography (MEG)David2008Lexical ; Ploner2010Functional ; florin2010effect , and functional magnetic resonance imaging (fMRI)Roebroeck2005Mapping ; sridharan2008critical ; Menon2010saliency ; Ryali2011Multivariate . In these studies, the time series data are interpreted to reflect neural activity from a source, and GCA is used to characterize the directionality, directness and dynamic of influence between sources. Duo to the datadriven and simple mathematical form, the GCA methods may have some advantages to explore the directionality in complex neural networks, and the resulting causal connection network will be clarified more conveniently and simply. At the same time, GCA integrates the historical information and does not require a priori specification of the network model. These properties also lead to be more capable of identifying causal connectivities in following largescale network study. However, causal connection is complicated, relatively immature. In the current GCA research, the determination of the model orders of its historical information in model space is based on the AIC/BIC theory, and causal effect of the external information is quantified by the FstatisticDeshpande2010Multivariate ; Guo2010Granger ; Bressler2011Wiener ; Li2010A . There may be inconsistencies in the selection criteria of different mathematical theories, the subjectivity of the significance level in Fstatistics, and the complexity of the operation. In a generalized perspective, these twostage scheme can be considered as a process that maps different information into the selected model spaces in GCA for quantitative comparison. Thus in conventional GCA, whether it combines with AIC/BIC to regress historical information or it regresses exogenous information of external variables by Fstatistics, the twostage scheme is a regression or model selection problem. Therefore, we need a consistent model selection method for GCA in the context of general regression model paradigm.
To keep consistency in model selection for GCA, we proposed a code length guided framework based on MDL principle, which Rissanen first proposed to quantify parsimony of a modelRissanen1978Modeling , to map the twostage scheme into the same model space. MDL has intellectual roots in the algorithmic or descriptive complexity theory of Kolmogorov, Chaitin, and SolomonoffLi2008introduction
. Only considering the probability distribution as a basis for generating descriptions, Rissanen endowed MDL with a rich informationtheoretic interpretation
shannon1948mathematical ; Rissanen1983A ; rissanen1984universal ; Rissanen1986Stochastic ; Rissanen1987Stochastic ; rissanen1996fisher . Due to these characteristics, causality analyzed with help of MDL principle may be more in line with physiological models. And model selection by a single mathematical principle throughout GCA, which could be easily ignored, would determine the final result working pattern of human brain. Thus the causal connectivity obtained by conventional GCA could be misleading, and main reason may be splitting model selection into a twostage scheme. Our proposal incorporated the endogenous and the exogenous information into a unified model selection process, which the information will be quantified by converting into code length to obtain causality. The twostage scheme based on the two different theories is unified under the single mathematical framework, MDL principle, which guarantees the only benchmark in the GCA methodology research. Above all, model selection for endogenous information and exogenous information by our code length guided framework is meaningful to understand the further causal connection in human brain.At the same time, we explored the expression of MDL principle applicable to different function spaces. Generally, we pick models to describe causal connectivities in time domain, which the result causal connection seems to be complicated. And causal connectivities in the frequency domain could reveal the working pattern essentially. Oscillation is a ubiquitous feature of neurophysiological systems and neuroscience data. With help of Geweke’s workJohn1982Measurement , the Granger Geweke method seems to provide neuroscientists with exactly what they want, that is, an assessment of direct, frequency dependent, functional influence between time seriesbrovelli2004beta ; Ding2006Granger ; Guo2008Uncovering ; seth2010matlab ; Barnett2014The ; Stokes2017A . Following our initial thought, here we prepare to analyze causality in frequency domain with the help of our code length guided framework, which may be more consistent with physiological model of human brain. Causal connections at specific frequencies between regions could be helpful for future clinical medical applications.
2 Material and methods
2.1 Problem Statement
Let variables and be two stochastic and stationary time series. Now consider the following pair of (restricted and unrestricted) regression models
(1) 
(2) 
where and are the model orders (the numbers of time lags) in and , respectively, and are the model coefficients, and and are the residual of the models. The order of history predictor is usually determined with the AICAkaike1974new or the BICSchwarz1978estimating .
The conventional GCA requires statistical significance to determine whether the unrestricted model provides a better prediction than the restricted model. The hierarchical statistics, based on the extra sumofsquares principle Casella , can be used to evaluate significant predictability, given as
where and represent the sum of squared residuals of restricted model and unrestricted model, respectively,
is the total number of observations to estimate the unrestricted model. The
statistics approximately follows an distribution with (,If , the variability of the residual of unrestricted model is significantly less than the variability of the residual of restricted model, then there is an improvement in the prediction of due to , and we refer to this as causal influence from to . But in the current bivariate model, spurious connections will emerge frequently.
In order to remove spurious connections caused by indirect causalities between nodes, GCA also provides a measure of conditional causal connection by introducing another variable into Eqs. (1) and (2):
(3) 
Then, the causal influence from to , conditional on , is defined as
(4) 
Consider three model spaces , , and , where each space comprises three models, and let and , , denote the model orders regarding its endogenous information and the exogenous information from other variables, respectively,
Where We further assume that , , and ,
, have the same residual variance. For space
, the models only specify the regression model orders of the endogenous information by AIC/BIC, and then specify the causal effect of endogenous information by Fstatistics. The models in space specify the regression model orders of the exogenous effect by AIC/BIC, then specify the causal effect of exogenous information by Fstatistics. But the models specify the regression model orders of endogenous information and exogenous information separately, and specify the causal effect between , and by Fstatistics. In this situation, the model selection in conventional GCA is split into a twostage scheme, the model orders is determining by AIC/BIC and then the causal effect is quantified by Fstatistics sequentially. It is clear that the final inference might differ with rules applied, even though three classes are completely equivalent from a model description standpoint.As we stated above, specifying the effects of endogenous and exogenous information in conventional GCA is split into a twostage scheme. Specifying the regression model orders of historical information, which contains the regression of endogenous information in Eq. (1) and the regression of both endogenous and exogenous information in Eq. (2), is mainly based on AIC/BIC theory separately, and then specifying the causal effect of exogenous information by Fstatistics. However, specifying the two effects is the same kind of problem of model selection, which is equivalent from the mathematical perspective. Different theories might generated different benchmarks in two stages for model selection, therefore degrade the performance of GCA.
Aside from theoretical considerations, pairwise statistics also arouses several potential problems that might lead to misleading or unreasonable inferences in connectivity analysis. Firstly, the model comparison with statistics is performed under a specific significance level. The assignment of significance level is a subjective matter in conventional GCA processMiao_PLoS11 ; Deshpande_NeuroImage11 . A significance level that is too low could cause the false connection noise originated, whereas a significant level that is too high could erase the actual connection. When there is no rule to justify the assignment of the significance level, statistics will lead to different connectivity results depending on the significance level chosen.
Secondly, the selection results by pairwise statistics are heavily dependent on the initially selected model and the search path in model space. Consider the collection of three models and let denote the residual variance of each model with , , and . We further assume , where is the interval satisfying statistical significance. The aim of statistics is to find the model appropriate to the observations from the model space, which is evidently model in this case. The search path will arrive at optimal model , whereas if we start at and follow the search path , we will reach suboptimal model , that is, optimal model is not considered. The distinct results obtained by using statistics for model selection along with different search paths and initial models are given explicitly in Fig. 1. In fact, statistics can not discriminate the models by residual variance within a specific range relative to the chosen significance level and the noise level. On the other hand, statistics uses the extra sum of squares principle to identify a better model. This means that only models with a nested relationship can be compared. Then the competitive models perform pairwise comparison in an indirect way, through model, namely unrestricted model in statistics. Therefore, the comparison between any two models is not available in practical applications, and the search path relies on the structure of all candidate models. In such situation the optimal model is not always guaranteed.
The third potential problem relates to computational complexity. Consider the simplest case where conditional causal connections are not taken into account ( and in Eq. (3) and (2.1)). Suppose that there are candidate models for any one directional causality in the network with nodes, the number of tests that needs to be performed is , and the total number of comparisons is . The problem on computational complexity is compounded by conditional GCA, but it still will be intractable while investigating large network Zou_BMCB10 .
Although different approaches might be applied for model selection in GCA Kaminski ; Li_TMI11 , the two stages are kept unchanged. They are generally based on two different mathematical theories in most GCA applications. However, as mentioned above, determining the model orders by AIC/BIC or quantifying the causal effects by Fstatistics can essentially be considered as a generalized model selection problem from a modeling standpoint. Since all the issues we discuss can be attributed to model selection problems, then a more practical model selection method need to be enabled.
The MDL is an information criterion that provides a generic solution for the model selection problem bryant2000model . As a broad principle, the MDL represents a completely different approach for model selection relative to traditional statistical approaches, such as
statistics and the AIC or BIC. Compared with the AIC/BIC, the use of the MDL does not require any assumptions about the data generation mechanism. In particular, a prior probability distribution does not need to be assigned to each model class. The objective of model selection in the MDL is not to estimate an assumed but unknown distribution, but to find models that more realistically represent the data
Grunwald .In general, the MDL principle strikes a balance between model complexity and fitting error of a model, and discriminates between competing model classes based on the complexity of each description Hansen_DECI99 ; Hansen2001Model . This proposal involves constructing a code length guided framework, then the model selection process in GCA can be convert into comparing the code length of each model. The endogenous information and the exogenous information will be uniformly encoded to be the code length, and MDL will select the model with the minimum total code length as optimal model. To solve the above potential problems in conventional GCA, we took the MDL principle as the single mathematical framework to select the generalized model for GCA, which to ensure the consistency, objectivity and the parsimony.
2.2 The minimum description length principle
We all know that the principle of parsimony is the soul of model selection. To implement the parsimony principle, one must quantify parsimony of a model relative to the available data. With help of the work of Kolmogorovkolmogorov1965three ; kolmogorov1968logical , Wallace and BoultonWallace1968An , Rissenan formulated MDL as a broad principle governing statistical modeling in general. At beginning of modeling, all we have is only the data. Luckily, With the help of the algorithmic or description complexity theory of Kolmogorov, Chaitin, and Solomonoff, MDL regards a probability distribution as a descriptive standpoint. And MDL has some connections with AIC and BIC, sometimes it behaves similarly to AIC and BICRissanen ; Hansen2001Model . The difference is that MDL fixes attention on the length function rather than the code system. Therefore, many kinds of probability distribution can be compared in terms of their descriptive powerCover2012elements . Our code length guided framework can be used in generalized model selection as long as there is a probability distribution in the model.
2.2.1 Probability and idealized code length
In order to describe the MDL principle explicitly, we deduced the formula of MDL in different cases. In model selection process of MDL, we need to compare the code length obtained by its probability distribution, so it is essential to understand the relationship between probability and the code lengthHansen2001Model ; grunwald2007minimum .
A code on a set is simply a mapping from to a set of codewords. Let be a finite binary set and let denote a probability distribution of the any element in . The code length of is that , the negative logarithm of . For example, the Huffman code is one of the algorithms that constructed this relationship between probability and idealized code lengthCover2012elements . Suppose that elements of are generated according a known distribution . Given a code on with length function , the expected code length of with respect to is defined as
(5) 
As is well known, if is prefix code, the code length is equivalent to for some distribution on . There was given an arbitrary code, if no codeword is the prefix of any other, the unique decodability is guaranteed. Any code satisfying this codeword condition is called a prefix code.
By Shannon’s Source Coding Theorem, for any prefix code on with length function , the expected is bounded below by , the entropy of . That is
(6) 
where equality holds if and only if , in other words, the expected code length is minimized when .
2.2.2 Crude twopart code MDL
In our view, modeling is a process that find the regularity of data and compress it. In model selection within MDL principle, What we have to do is selecting a suitable model based on the probability distribution of the object. Generally, the model we picked is overfitting or too simple. But model selection guided by the MDL principle, the complexity term or the error term in data fitting was incorporated into code length guided framework, which ensure the objectivity of the operation.
Until now, there are several forms of MDL principle to polynomial or other types of hypothesis and model selection. But at the original MDL, it usually divided the modeling for the data set into two parts, one part is to describe the model’s selfinformation. The other is to describe the data set with the help of chosen probability model in part one. Consequently, here we firstly introduce the most common implementation of the idea – the two part code version of MDLgrunwald2007minimum ; Hansen2001Model .
Suppose there is the data where . Then there is a probability , and minimize
(7) 
Here, it will select a reasonable model for to make good predictions of future data coming from the same source, which therefore models the data using the class
of all Markov chains
grunwald2007minimum .The first part
To get a better feel for the code , we prepare to consider two examples. First, let
be some Bernoulli distribution with
, and let . Since and is equal to the frequency of 1 in , the first part is given as(8) 
where denotes the number of occurrences of symbol in . Let , the thorder Markov chain model is denoted by , it’s defined as
Where , for all ,
(9) 
and
(10) 
Here is the number of bits needed to encode the first outcomes in . The maximum likelihood parameters are equal to the conditional frequencies of 1 prefixed by .
The second part
In order to describe a , we have to describe a pair . We encode the all parameter in the distribution model by firstly encoding , which will use some prefix code , and then code with the help of the prefix code . The resulting code (the code length of the first part) is then defined by . Firstly, N is nature number set,
(11) 
Since is an uncountable infinite set, we would have to discretize it firstly. More precisely, we will restrict to some finite set of parameters that can be described using a finite precision of bits per parameter.
Now that the number of elements of is . Thus it need bits to describe any particular , and may call the precision used to encode a parameter. Letting be the smallest such that , this gives
(12) 
At the end, the crude twopart code MDL for Markov chain hypothesis selection is given by,
(13) 
2.3 Code length guided model selection in causality analysis
As stated above, conventional GCA split the whole process into two stages, which were actually modeling endogenous information and exogenous information. Since MDL had a close relationship with information theory, causality analysis with MDL here will also be more convincing and suitable. Further the regression of endogenous information can be converted to code length, and relative effect of exogenous variables can be also quantified by the code length guided framework, which the whole model selection process for GCA is unified into same model space.
The following is that MDL principle guided model selection in causality analysisRissanen1978Modeling , and variable is given,
(14) 
And the parameter vector consists of data
where
is the varianceparameter of zeromean Gaussian distribution model for
. And , which can be anyone more than to keep the solution determined, is the data length. In order to describe , turn to Gaussian distribution for , we arrive atWhere denotes the residual sum of squares corresponding to the estimation in the model. Clearly there is a Gaussian distribution model, applying the two part form of MDL, the total code length is given as
(15) 
Where is the precision, and it’s optimal to choose Rissanen1983A ; Rissanen1989stochastic ; Hansen2001Model . Specially, should be ignored.
2.3.1 Timedomain formulation
Combining with the above formula, we can proceed to causality analysis with the code length guided framework in the time domain. There are two timeseries and , then we take two different models and (Eq. (16) and Eq. (18)) to describe . We have the autoregressive representation
(16) 
(17) 
Bivariate regressive representations are given,
(18) 
(19) 
and their contemporaneous covariance matrix is
Finally, and have Gaussian distribution with mean 0 and unknown variance , which denote the of timeseries are fitting residual, so do and . Therefore, the distribution of residual terms can be a standpoint to describe the model within MDL. At the same time, the code length of model and we obtained can be compared to identify the causal influence between and . In the whole process of model selection for GCA, we map the causality analysis into the unified code length guided framework. According to Eq.(15), the code length of model and model can be given respectively. By the definition of Granger causality, the influence from to is defined by our code length guided framework,
(20) 
Where denotes the code length of optimal model in Eq.(16), and denotes the code length of optimal model in Eq.(18). If , it means causal influence from to existed. Otherwise, there is no causal influence existed from to . As causality represent above, our proposal can unify twostage scheme into the code length guided framework, which can avoid inconsistency of two different mathematical theories or the subjectivity of Fstatistics in model selection of GCA.
To compare conditional GCA, we consider the influence from to while controlling for the effect from conditional node to . Firstly, the joint autoregressive representation is given
(21) 
(22) 
The noise covariance matrix for model can be represented as
Then the autoregressive representation involving three variables can be writen as follow
(23) 
(24) 
(25) 
and the contemporaneous covariance matrix is
By the definition of conditional GCA, if existed, causal influence from to conditioned is defined
(26) 
Same as above, if existed, causal influence from to conditioned Y is given
(27) 
Clearly, in our code length guided framework, code lengths of different models did not require repeated comparison. Different from the traditional method, the causal connection is obtained by repeated pairwise comparison between models, our method can map all selected models into the same model space without the repeating comparison and to obtain the conditional causal influence directly. Which is, if both and existed,
(28) 
Here if existed, it means that both and have direct influence on . But if is less than 0, there will be two cases. One is existed, it means only has direct influence on . The other is existed, it means that impacts directly. In the unified model space, multiple selected models can be directly compared by code length, which can release the complexity of the algorithm. In this way, our proposal is more in line with Occam’s razor, or the principle of parsimony.
2.3.2 Frequencydomain formulation
With help of Geweke’s workJohn1982Measurement , the total interdependence between two time series and can be decomposed into three components: two directional causal influences due to their interaction patterns, and the instantaneous causality due to factors possibly exogenous to the system (e.g. a common driving input)Guo2008Uncovering ; Guo2010Granger . Here we need other forms regressive representations, We first rewrite Eq.(18) and Eq.(19)
(29) 
where , the lag operator denoted
. Performing Fourier transform on both sides of Eq.(30) leads to
(30) 
Then we leftmultiply
(31) 
on both sides of Eq.(32) and rewrite the result equation, the normalized equations yield
(32) 
where . The spectral matrix is given
(33) 
where * denotes complex conjugate and matrix transpose. The spectrum of is found to be
(34) 
where .The first term in Eq.(35) is represented as the intrinsic influence and the second term as the causal influence of due to at frequency . Based on this transformation we had the causal influence from to at frequency as
(35) 
The model orders of historical information are determined by AIC or BIC in frequency domain for conventional GCA. Distinct from the conventional GCA, we obtained the causal connectivities between nodes at frequency by our code length guided framework.
3 Result
3.1 Synthetic data experiment
To verify the performance of MDL principle in synthetic data experiment, we have generated two data sets contained 3 nodes and 5 nodes, the relationships have shown in Fig. 2 and Fig. 6. There were 1000 data points in time series of each node, the initial value of each data set was 1 and was the noise terms to ensure the fidelity of data, thus the 3node network and 5node network were generated by
(36) 
Same as 3node network, 5node network was given, here it’s mentional that the first two time lag of each node was generated randomly,
(37) 
Where the noise variance in Eq. (36)(Eq. (37)) varied from 0.15(0.15) to 0.3(0.35). At the same time, to keep the stationarity of synthetic data, we removed the first 700 data points and only selected the last 300 data points for each node.
The fitting Guassian distribution of the Residual. (a): comparing the distribution of the data to the normal distribution. (b): a histogram of values in data and fits a normal density function.
3.2 Time domain case
Firstly, we considered the causal connection between 3node network in Fig. 2. To ensure the rigor of our proposal, we verified the distribution of residual between the selected model and the observed data. It was almost corresponding Gaussian distribution with mean 0, seen in Fig. 3 and 3
, which guaranteed the validity for our initial description standpoint. And code length changed with different time lag in autoregressive model had shown in Fig.
4, it dropped down to the minimum when time lag was 2, which corresponded with generated model in Eq.(36).Then, we validated the effectiveness of our proposal in a simple network, 3node network in Fig. 2. In order to ensure the rigor of the simulation, we verified simulation by 1000 samples. And the causal connectivities identified by our proposal have shown in Fig. LABEL:fig:3node_MDL, which obtained the causality by comparing different code lengths according to Eq. (15), then we have shown the causal connectivities obtained by conventional GCA in Fig. LABEL:fig:3node_cGCA. We found that causal influences from node 1 to node 2 and node 3 were identified both in GCA and our proposal. In other four causal influences, our proposal almost guaranteed 99% accuracy. But conventional conditional GCA only guaranteed 95% accuracy. At the same time, we also verified the accuracy of causal connection in each node and the overall true model, the comparison between conventional GCA and our proposal showed in Table. 1. Clearly, at different noise levels, our proposal showed its robustness. Whether it’s identifying a single node connection or an overall network connection, our proposal guaranteed an accuracy of over 96.5%.
Noise level  Method  Accuracy(%)  
Node 1  Node 2  Node 3  Total  
Low  GCA  82.3  71.8  72.1  65.4  
90.9  85.5  85.5  82  
98.1  96  96.2  95.2  
MDL  98.4  97.3  97.2  96.5  
Moderate  GCA  81.4  72.5  71.8  64.8  
89.4  84.6  84.7  79.9  
97.6  96.2  96.3  95.1  
MDL  98.5  97.7  97.4  96.8  
High  GCA  80.3  72  71.7  64.5  
89.5  83.6  83.7  79.1  
98  96.7  96.7  95.7  
MDL  98.8  98  97.6  97.2  
The variance of the low noise level data ranges from 1.5 to 3.5, and the moderate(high) level data ranges from 2.5 to 4.5(3.55.5). denotes the significance level of Ftest in GCA. The accuracy of node i(i=1,2,3) only measures the causal connectivities with node i, not including the connections between the other two nodes. And the total accuracy represents the accuracy that the true model is found, it means only quantifies the causal influence from node 1 to node 2 and node 3. 
On the other hand, the conventional GCA were very sensitive due to the different confidence intervals, especially in the identification of the overall network. But conventional GCA also had a stable performance at different noise level. Comparing the accuracy of the single node and the overall network at the same noise level and the same confidence interval, it’s unevenly distributed in conventional GCA. In generally, our proposal showed a relatively good performance whether it’s identification in the whole or a single edge. It was worth noting that the results obtained by conventional GCA when
in Fstatistics were close to the results of our proposal. The results were also in line with our expectations, that was because our proposal consider the complexity of the model more thoughtfully. As we emphasized above, there is only one mathematical principle guiding the model selection throughout the GCA process, thus our proposal is a more rigorous approach or a more robust approach. Alternatively we ranged the variance of noise from (1.50.35) to (0.350.55), the result causal networks were not changed, just same as Fig. 5.To further verify the validity and robustness, our proposal analyzed causal connection between 5node network and compared with conventional GCA in different values of . The connections between nodes in Fig. 6 are more complicated. For example, node 4 and node 5 has direct influence on each other, and node 1 has indirect influence on node 3, 4 and 5, there is no influence on the contrary, and so on. More complicated network is a challenge for the robustness of methods. Same as Eq. (36), noise terms in Eq. (37) were given Guassian distribution with a mean of 0. To confirm the robustness, and just stated above, the noise variance in our simulation ranged from 0.15 to 0.5.
Edge  GCA  Our proposal  
true connection  1000/1000  (1000/1000)  1000/1000  
1000/1000  (1000/1000)  1000/1000  
1000/1000  (1000/1000)  1000/1000  
507/1000  (456/1000)  1000/1000  
1000/1000  (1000/1000)  1000/1000  
929/1000  (780/1000)  1000/1000  
false connection  36/1000  (8/1000)  4/1000  
43/1000  (1/1000)  1/1000  
11/1000  (0/1000)  0/1000  
54/1000  (12/1000)  15/1000  
58/1000  (14/1000)  15/1000  
44/1000  (17/1000)  7/1000  
15/1000  (2/1000)  2/1000  
83/1000  (29/1000)  5/1000  
30/1000  (8/1000)  14/1000  
6/1000  (0/1000)  2/1000  
5/1000  (1/1000)  1/1000  
53/1000  (18/1000)  16/1000  
17/1000  (1/1000)  2/1000  
41/1000  (8/1000)  11/1000  
There are 1000 samples synthetic data from 5node network causal connectivities showed above. And we showd causal connectivities between nodes with low variance noise(0.150.3) at in Ftest. True connection represents the causal influence truly existed in Fig.5, false connection means not existed. 
The results of 20 relationships between 5 nodes have shown in Fig. 7, causal influence analyzed by code length was largely consistent with relationships in Fig. 6. Simultaneously, causalities analyzed by conventional GCA between 5 nodes have shown in Fig. 7. Same as results in 3node network, our proposal showed 100% consistency with relationships in Fig. 6 between directly causal related nodes, seen in Table. 2. And in other relationships in 5node network, the accuracy of our proposal also was not below 98.4%. Whereas, conventional GCA did not showed the same robustness of our proposal. Causal connectivities between direct related nodes was not well identified, for example only 507 samples were identified as causal influences from node 3 to node 4 in 1000 samples and causal influence from node 5 to node 3 was identified at 92.9% accuracy. And in other relationships where there were no direct causal connectivities in 5node network, the accuracy of conventional GCA was more poor than our proposal.
Meanwhile, in conditional causality analysis, our approach reduced the complexity of the algorithm, which did not need repeated pairwise comparison between models, and ensured the accuracy of the results. Obviously, when the target network was more complicated in the simulation, our proposal showed a more desirable property in time domain, while conventional GCA generally made mistakes. Luckily, our proposal performed very well regardless of the existence fo relationships between nodes, even as in more complicated network. More importantly, whether the causal influence from Node 3 to Node 4 or from Node 5 to Node 4, the significance level of connection were not enough to be identified, even for in Fstatistics. Therefore, the above demonstrated that our proposal is not equivalent to the conventional GCA with high significance level in Fstatistics at all. Same as 3node network, the result causal network was unrelated with the varying variance of noise from (0.150.3) to (0.350.5).
3.3 Frequency domain case
The oscillations in neurophysiological systems and neuroscience data are thought to constrain and organize neural activity within and between functional networks across a wide range of temporal and spatial scales. GewekeGranger causality demonstrated that the oscillations at specific frequencies had been associated with the activation or inactivation of different encephalic region. But for conventional GCA in frequency domain, model selection of history information was determined by AIC or BIC. For our code length guided framework, the selected models for history information will regress into the model space automatically. Due to its intellectual roots in descriptive complexity and close tie with information theory, our proposal may be more capable to identified causality in frequency domain. Same as time domain, causal connectivities at frequency between 3node network obtained by our proposal showed in Fig. 8.
In particular, causal connectivities in frequency domain were obtained within two nodes, which meant that we did not introduce conditional GCA in the frequency domain. This is mainly because there seem to be still some obfuscation with the method of using conditional GCA concept in the frequency domain. Therefore, as shown in Fig. 8, some nonexistent connections between nodes were often misjudged, except for the causal influence and . But we found that causal connectivities between node 2 and node 3 had a bigger chance to be misjudged at low frequency(010Hz). Actually, comparing results obtained by conventional GCA in the time domain, causalities between two nodes were more legible in frequency domain, which meant only direct causalities showed more consistent results in our analysis. For example, there are only stable and significant causal influence existed in and .
Subsequently, we have identified the causal connectivities among 5node network in frequency domain by our proposal, seen in Fig. 9. Same as the 3node network, causal connection networks have been identified without introducing conditional GCA, and the causal connection network had regular characteristics. The causal influence whether it was direct or indirect existed in 5node network was more stable to be identified in the frequency domain. Similarly, since conditional GCA was not considered, other nonexistent causal connectivities had a chance to be identified. And in 5node network, the possibility of being misjudged was even greater. Therefore, it is necessary to introduce conditional GCA to distinguish direct from indirect influences between system components in the frequency domain. And at the same time, we found that removing the noise frequency component is an obstacle to causality analysis, main reason is that we have no prior knowledge about which one is noise or others in row data. Thus noise term could be hard to eliminate in frequency domain.
3.4 Real data experiment
3.4.1 Data preparing
In the study we let ten subjects perform simple onedigit(consisting of 110) serial addition (SSA) and complex twodigit(consisting of 15) serial addition (CSA) by visual stimulus and simultaneously measured their brain activities with fMRI. Additionally, we asked the subjects to perform same tasks by auditory stimulus, and then compared the results of two modalities. Nine righthanded healthy subjects (four female, 241.5 years old) and one lefthanded healthy female subject (24 years old) participated. One of the subject’s(a right hand male) data was deleted due to excessive head motion. All subjects volunteered to participate in this study with informal written consent by themselves.
The causal connection network of subjects performing same task should be same or similar, which is also the basic assumption in group analysis. Therefore, for mental calculation under different stimuli, causal connection network of same subject may be different in the input node of stimulus, but we have reason to assume that the causal connection should be same inside the mental calculation network, at least it should be similar. Thus, we will compare the similarity of causal connection network within the same mental calculation task under different stimuli. More directly, in order to quantify which method is more robust, we will compare the similarity of the result network within auditory and visual stimulus on each subject. To obtain the similarity between networks, we defined a measure method to quantify the similarity,
(38) 
Where and are the connected matrix of mental calculation networks. The numerator represents the sum of intersections of the same connected edges, and the denominator is the sum of the unions of the connected edges.
3.4.2 Mental calculation network
We already have verified the validity of MDL in simulation data, but behavior in real data will determine the truly robustness of the methods. Firstly, the activation regions of mental calculation under different stimuli showed in Fig. 10, and as we assumed above, the activation regions between two modes were identical. The similarities in 4node connection network and 6node full connection network have shown respectively, seen in Fig. 11. Firstly, in 4node network, we found that causal connection networks of mental calculation between visual stimulus and auditory stimulus were very similar for most subjects, except for subject 6. There are seven of nine subjects the similarities were above 0.6. Turned to conventional GCA, we found that only three subjects of causal connection networks between the two stimuli have a similarity above 0.6. Meanwhile, only in subject 1 and subject 9, we found the similarity of causal networks between two modes was above our proposal. Even in subject 1 and subject 9, the similarities in our proposal were close to conventional GCA, especially in subject 1. Further the similarities in 6node full connection network also have shown in Fig.11. Duo to the difference between the input node of stimulus, the similarities in two methods were not above 0.5 mostly, but there were still 3 subjects the similarities in our proposal were mostly above 0.5, especially for subject 7. Clearly, whether inside connection network of mental calculation or among the 6node full network, the similarities in most of our subjects identified by our proposal were more than conventional GCA.
Then, causal connection networks of two subjects in mental calculation have shown in Fig. 12 and Fig. 13 respectively. As stated above, the input node of the stimulus should not be included into the network to be compared, we removed the causal connection between stimuli nodes and other four nodes of mental calculation network. Obviously, our proposal also had a desirable robustness in fMRI data. As seen in Fig. 12 and Fig. 13, for subject 2 and subject 7, the mental calculation networks under the two different stimuli were almost identical. And causal connection networks of two subjects above were identified by conventional conditional GCA, which showed in Fig. 12 and Fig. 13. Comparing the similarity between causal networks identified by our proposal, the result obtained by conventional GCA was more irregular. As consistent in the simulation, our proposal was more robust in identifying the causal connection network, especially in complex networks.
4 Discussion
As we have stated above, in conventional GCA, the AIC or BIC theory determines the regression model orders of historical information, while the causal effect of exogenous information is established by estimating the residual between the two models in the model space to be selected by the statistical Ftest. There may be several potential problems: (1)the BIC/AIC theory and the statistics Ftest belong to different mathematical theories. There are different selection criteria under different mathematical frameworks, so they may select completely inconsistent model results under these two different benchmark. (2)The establishment of confidence intervals in statistical tests is highly subjective. In theory, the discovery and verification of scientific laws should choose different confidence intervals. For the different research field, the criteria of selecting confidence intervals are not the same. Unfortunately, there is a lack of similar criteria in neuroimaging studies, which are more random, and the choices of different confidence intervals are also very different. (3)Optimal model is selected by repeated pairwise comparison between models, and a larger candidate global model space need to be searched to avoid the dependence on the initial selection model and the search path, which all will increase the computational complexity simultaneously. Essentially, regressing the historical information by AIC/BIC is mapping the historical information into the model space for quantitative comparison, and Fstatistics is also quantitatively comparing the information about causal effect of exogenous variables in another model space. As the formula below, the two mathematical theories are regressing the information into their own framework to obtained causality. That is, it’s a model selection process for different information in GCA.
Where represents the likelihood of model tested, when evaluated at maximum likelihood values of . denotes the number of model parameters, and is sample size. The and represent the sum of squared residuals of restricted model and unrestricted model respectively. The and represent the maximum lag period of and respectively.
Therefore, we hold a single mathematical framework in model selection for GCA to distilled the thinking above. Our code length guided framework converted conventional twostage scheme to a unified model space, which keep the consistency of mathematical theories in model selection. Comparing the difference between GCA and our proposal, model selection in the causal connection network will automatically regress to the optimal solution of model space by comparing code lengths of models, unlike the model selected subjectively in Fstatistics of conventional GCA. Meanwhile, our proposal specifies the absolute quality of the models, which means not repeated pairwise comparison of models. Consequently, causality analysis of our proposal is more strict or more rigorous.
Finally, the above method of quantifying similarity is just one of the many metric spaces, which means that it only measure the similarity in a specified space. At the same time, this method only measures the similarity of a single connected edge in a causal network, and the deeper similarity of multiple connected edges is not considered. In other words, in terms of network similarity, the network of may be different from the network . We applied this method duo to its simplicity, and a more selfcontained metric space is necessary.
5 Conclusion
The feasibility and efficacy of the proposed strategy has been demonstrated with simulated and measured fMRI data, compared with conventional GCA. Regardless of simple or complex network, our proposal can always identify the causal connection between nodes, which was more robust. However, our proposal seems need to be introduced the conditional GCA in frequency domain, that’s because our current method cannot distinguish the direct connection from the indirect or the fake connections. Even though our proposal only had a high consistency with the true connection existed in network. Luckily, our proposal showed a good robustness in real data, except for misjudged connectivities in few sample. Noise term is one of the reason for the misjudgment. More we considered that the regions of interest(ROI) we picked in mental calculation are regions in our cerebrum, rather than nodes in our simulation. In other words, regions contained a larger range, and it seemed more likely that there were hidden nodes between regions. But in the whole, our method showed a superior performance for identifying causal connectivities, and would be desirable to determine neural circles.
At most statistical work, modeling and estimation are about extracting information from the often chaotic looking data to learn what it is that makes the data behave the way they do. And the different information extracted from different models is the key to describe data set. So does the model selection process in GCA, thus we picked the MDL principle to guide model selection for GCA, which the MDL principle can provide a unified coed length guided framework to identify causal connection. Unlike model selection in conventional GCA, we emphasized that a single mathematical theory should hold throughout GCA. Fundamentally, MDL has root in descriptive complexity theory of Kolmogorov and developed heavily with the help of Shannon’s information theory. In general, our proposal can guarantee the consistency of mathematical theory, the objectivity of criteria and the parsimony of operation, which means it may be more likely to conform to the physiological model of brain work. Indeed, it’s worth noting the effect of inconsistency of mathematical theories on experimental results in the process of data analysis and statistical modeling.
Acknowledgments
This work was partly supported in part by the grant from National Key RD Program of China (2018YFA0701400), in part by the National Basic Research Program of China under Grant 2013CB329501, in part by the National Natural Science Foundation of China under Grant 81271645, in part by the Public Projects of Science Technology Department of Zhejiang Province under Grant 2013C33162, and in part by the Zhejiang Provincial Natural Science Foundation of China under Grant LY 12H18004, in part by the Fundamental Research Funds for the Central Universities, 2019QNA5026.
References
 (1) K. Amunts, C. Ebell, J. Muller, M. Telefont, A. Knoll, T. Lippert, The human brain project: Creating a european research infrastructure to decode the human brain, Neuron 92 (3) (2016) 574–581.
 (2) C. L. Martin, M. Chun, The brain initiative: Building, strengthening, and sustaining, Neuron 92 (3) (2016) 570–573.
 (3) M. M. Poo, J. L. Du, N. Y. Ip, Z. Q. Xiong, B. Xu, T. Tan, China brain project: Basic neuroscience, brain diseases, and braininspired computing, Neuron 92 (3) (2016) 591–596.
 (4) A. B. A. S. Committee, Australian brain alliance, Neuron 92 (3) (2016) 597–600.
 (5) H. Okano, E. Sasaki, T. Yamamori, A. Iriki, T. Shimogori, Y. Yamaguchi, K. Kasai, A. Miyawaki, Brain/minds: A japanese national brain project for marmoset neuroscience, Neuron 92 (3) (2016) 582–590.
 (6) S. J. Jeong, H. Lee, E. M. Hur, Y. Choe, J. W. Koo, J. C. Rah, K. J. Lee, H. H. Lim, W. Sun, C. Moon, Korea brain initiative: Integration and control of brain functions, Neuron 92 (3) (2016) 607–611.
 (7) S. L. Bressler, Largescale cortical networks and cognition, Brain Research Reviews 20 (3) (1995) 288–304.
 (8) M. Rubinov, O. Sporns, Complex network measures of brain connectivity: Uses and interpretations, Neuroimage 52 (3) (2010) 1059–1069.
 (9) O. Sporns, The human connectome: a complex network, Annals of the New York Academy of Sciences 1224 (1) (2011) 109–125.
 (10) M. Siegel, T. H. Donner, A. K. Engel, Spectral fingerprints of largescale neuronal interactions, Nature Reviews Neuroscience 13 (2) (2012) 121–134.
 (11) R. F. Helfrich, H. Knepper, G. Nolte, M. Sengelmann, P. K?nig, T. R. Schneider, A. K. Engel, Spectral fingerprints of largescale cortical dynamics during ambiguous motion perception, Human Brain Mapping 37 (11) (2016) 4099–4111.
 (12) S. L. Valk, B. C. Bernhardt, A. Böckler, M. Trautwein, T. Singer, Sociocognitive phenotypes differentially modulate largescale structural covariance networks, Cerebral Cortex 27 (2).
 (13) O. Sporns, D. R. Chialvo, M. Kaiser, C. C. Hilgetag, Organization, development and function of complex brain networks, Trends in Cognitive Sciences 8 (9) (2004) 418–425.
 (14) E. Bullmore, O. Sporns, Complex brain networks: graph theoretical analysis of structural and functional systems, Nature Reviews Neuroscience 10 (3) (2009) 186–198.
 (15) S. H. Strogatz, Exploring complex networks, nature 410 (6825) (2001) 268.
 (16) O. Sporns, Making sense of brain network data, Nature methods 10 (6) (2013) 491.
 (17) N. Wiener, The theory of prediction., Modern mathematics for engineers.
 (18) C. W. J. Granger, Investigating causal relations by econometric models and crossspectral methods., Econometrica 37 (3) (1969) 424–438.
 (19) D. W. G. Jr., J. A. Segawa, S. P. Ahlfors, F.H. Lin, Lexical influences on speech perception: A granger causality analysis of meg and eeg source estimates., Neuroimage 43 (3) (2008) 614–623.
 (20) D. Olivier, G. Isabelle, S. Sandrine, R. Sebastien, D. Colin, S. Christoph, D. Antoine, V.S. Pedro, Identifying neural drivers with functional mri: An electrophysiological validation, Plos Biology 6 (12) (2008) 2683–97.
 (21) T. Van Kerkoerle, M. W. Self, B. Dagnino, M.A. GarielMathis, J. Poort, C. Van Der Togt, P. R. Roelfsema, Alpha and gamma oscillations characterize feedback and feedforward processing in monkey visual cortex., Proceedings of the National Academy of Sciences 111 (40) (2014) 14332–14341.
 (22) A. Bastos, J. Vezoli, C. A. Bosman, J. M. Schoffelen, R. Oostenveld, J. R. Dowdall, P. Deweerd, H. Kennedy, P. Fries, Visual areas exert feedforward and feedback influences through distinct frequency channels., Neuron 85 (2) (2015) 390–401.
 (23) A. Brovelli, M. Ding, A. Ledberg, Y. Chen, R. Nakamura, S. L. Bressler, Beta oscillations in a largescale sensorimotor cortical network: directional influences revealed by granger causality, Proceedings of the National Academy of Sciences 101 (26) (2004) 9849–9854.
 (24) X. Wang, Y. Chen, M. Ding, Estimating granger causality after stimulus onset: A cautionary note, Neuroimage 41 (3) (2008) 767–776.
 (25) M. Dhamala, G. Rangarajan, M. Ding, Analyzing information flow in brain networks with nonparametric granger causality, Neuroimage 41 (2) (2008) 354–362.
 (26) M. Ploner, J. M. Schoffelen, A. Schnitzler, J. Gross, Functional integration within the human pain system as revealed by granger causality, Human Brain Mapping 30 (12) (2010) 4025–4032.
 (27) E. Florin, J. Gross, J. Pfeifer, G. R. Fink, L. Timmermann, The effect of filtering on granger causality based multivariate causality measures, Neuroimage 50 (2) (2010) 577–588.
 (28) A. Roebroeck, E. Formisano, R. Goebel, Mapping directed influence over the brain using granger causality and fmri, Neuroimage 25 (1) (2005) 230–242.
 (29) D. Sridharan, D. J. Levitin, V. Menon, A critical role for the right frontoinsular cortex in switching between centralexecutive and defaultmode networks, Proceedings of the National Academy of Sciences 105 (34) (2008) 12569–12574.
 (30) V. Menon, L. Q. Uddin, Saliency, switching, attention and control: A network model of insula function, Brain Structure and Function 214 (56) (2010) 655–667.
 (31) S. Ryali, K. Supekar, T. Chen, V. Menon, Multivariate dynamical systems models for estimating causal interactions in fmri, Neuroimage 54 (2) (2011) 807–823.
 (32) G. Deshpande, S. Laconte, G. A. James, S. Peltier, X. Hu, Multivariate granger causality analysis of fmri data., Human Brain Mapping 30 (4) (2010) 1361–1373.
 (33) S. Guo, C. Ladroue, J. Feng, Granger Causality: Theory and Applications, 2010.
 (34) S. L. Bressler, A. K. Seth, Wiener–granger causality: A well established methodology, Neuroimage 58 (2) (2011) 323–329.
 (35) X. Li, G. Marrelec, R. F. Hess, H. Benali, A nonlinear identification method to study effective connectivity in functional mri, Medical Image Analysis 14 (1) (2010) 30–38.
 (36) J. Rissanen, Modeling by shortest data description, Automatica 14 (5) (1978) 465–471.
 (37) M. Li, P. Vitányi, et al., An introduction to Kolmogorov complexity and its applications., Vol. 3, Springer, 2008.
 (38) C. E. Shannon, A mathematical theory of communication, Bell system technical journal 27 (3) (1948) 379–423.
 (39) J. Rissanen, A universal prior for integers and estimation by minimum description length, Annals of Statistics 11 (2) (1983) 416–431.
 (40) J. Rissanen, Universal coding, information, prediction, and estimation, IEEE Transactions on Information theory 30 (4) (1984) 629–636.
 (41) J. Rissanen, Stochastic complexity and modeling, Annals of Statistics 14 (3) (1986) 1080–1100.
 (42) J. Rissanen, Stochastic complexity, Journal of the Royal Statistical Society 49 (3) (1987) 223–239.
 (43) J. J. Rissanen, Fisher information and stochastic complexity, IEEE transactions on information theory 42 (1) (1996) 40–47.
 (44) J. Geweke, Measurement of linear dependence and feedback between multiple time series, Publications of the American Statistical Association 77 (378) (1982) 304–313.
 (45) M. Ding, Y. Chen, S. L. Bressler, Granger causality: Basic theory and application to neuroscience, Quantitative Biology (2006) 826–831.
 (46) S. Guo, J. Wu, M. Ding, J. Feng, Uncovering interactions in the frequency domain, Plos Computational Biology 4 (5) (2008) e1000087.
 (47) A. K. Seth, A matlab toolbox for granger causal connectivity analysis, Journal of neuroscience methods 186 (2) (2010) 262–273.
 (48) L. Barnett, A. K. Seth, The mvgc multivariate granger causality toolbox: A new approach to grangercausal inference, Journal of Neuroscience Methods 223 (2014) 50–68.
 (49) P. A. Stokes, P. L. Purdon, A study of problems encountered in granger causality analysis from a neuroscience perspective, Proc Natl Acad Sci U S A 114 (34) (2017) E7063.
 (50) H. Akaike, A new look at the statistical model identification., in: Selected Papers of Hirotugu Akaike, Springer, 1974, pp. 215–222.
 (51) G. Schwarz, et al., Estimating the dimension of a model., The annals of statistics 6 (2) (1978) 461–464.
 (52) G. Casella, R. L. Berger, Statistical Inference, 2nd Edition, Duxbury Resource Center, Pacific Grove, CA, USA, 2001.
 (53) X. Y. Miao, X. Wu, R. Li, K. W. Chen, L. Yao, Altered connectivity pattern of hubs in defaultmode network with alzheimer’s disease: An granger causality modeling approach, PLoS One 6 (10) (2011) e25546.
 (54) G. Deshpande, P. Santhanam, X. P. Hu, Instantaneous and causal connectivity in resting state brain networks derived from functional mri data, NeuroImage 54 (2) 1043–1052.
 (55) C. L. Zou, C. Ladroue, S. X. Guo, J. F. Feng, Identifying interactions in the time and frequency domains in local and global networks  A Granger Causality Approach., BMC Bioinformatics 337 (11) (2010) 1–17.
 (56) M. Kaminski, Multichannel Data Analysis in Biomedical Research., 2nd Edition, Duxbury Resource Center, Pacific Grove, CA, USA, 2007.
 (57) X. F. Li, D. Coyle, L. Maguire, T. M. McGinnity, H. Benali, A Model Selection Method for Nonlinear System Identification Based fMRI Effective Connectivity Analysis., IEEE Transactions on Medical Imaging 30 (7) (2011) 1365–1380.
 (58) P. G. Bryant, O. I. CorderoBrana, Model selection using the minimum description length principle, The American Statistician 54 (4) (2000) 257–268.
 (59) P. D. Grünwald, I. J. Myung, M. A. Pitt, Advances in Minimum Description Length: Theory and Applications, Neural Information Processing series, The MIT Press, Cambridge, Massachusetts, 2005.
 (60) M. Hansen, B. Yu, Bridging AIC and BIC: An MDL Model Selection Criterion., in: Worskhop on Detection, Estimation, Classification, and Imaging, Santa Fe, NM, USA, 1999, pp. 24–26.
 (61) M. H. Hansen, B. Yu, Model selection and the principle of minimum description length, Publications of the American Statistical Association 96 (454) (2001) 746–774.
 (62) A. N. Kolmogorov, Three approaches to the quantitative definition of information, Problems of information transmission 1 (1) (1965) 1–7.

(63)
A. Kolmogorov, Logical basis for information theory and probability theory., IEEE Transactions on Information Theory 14 (5) (1968) 662–664.
 (64) C. S. Wallace, D. M. Boulton, An information measure for classification”computer journal, Computer Journal 11 (2) (1968) 185–194.
 (65) J. Rissanen, Information and Complexity in Statistical Modeling, Information Science and Statistics, Springer Verlag, New York, NY, USA, 2005.
 (66) T. M. Cover, J. A. Thomas, Elements of information theory., New York:Wiley, 1991.
 (67) P. D. Grünwald, The minimum description length principle, MIT press, 2007.
 (68) J. Rissanen, Stochastic complexity in statistical inquiry, World Scientific, 1989.