I Introduction
Today’s scientific world produces a vastly growing and technologydriven abundance of data across all research fields from observations of natural processes to economic data science2011 . To test or generate hypotheses on interdependencies between processes underlying the data, statistical measures of association are needed. Recently, Reshef et al. Reshef2011 put forward two key demands such a measure should fulfill in the bivariate case: (1) generality, i.e., the measure should not be restricted to certain types of associations like linear measures, and (2) equitability
, which means that the measure should reflect a certain heuristic notion of coupling strength, i.e., it should give similar scores to equally noisy dependencies. The latter is especially important for comparisons and ranking of the strength of dependencies. In this article we generalize this idea to multivariate data as needed to reconstruct interaction networks in the fields of neuroscience, genetics, climate, ecology and many more. For the multivariate case we propose to add two more basic properties: (3)
causality, which means that the measure should give a nonzero value only to the dependency between lagged components of a multivariate process that are not independent conditional on the remaining process. (4) coupling strength autonomy, implying that also for dependent components we seek for a causal notion of coupling strength that is well interpretable, in that it is uniquely determined by the interaction of the two components alone and in a way autonomous of their interaction with the remaining process. To understand this, consider a simple example: Suppose we have two interacting processes and and a third process , that drives both of them. Then a bivariate measure of coupling strength between and will be influenced by the common input of , while our demand is, that the measure should be autonomous of the interactions of and with . In an experimental setting this corresponds to keeping fixed and solely measuring the impact of a change in on averaged over all realizations of . This property can be regarded as one ingredient of a multivariate extension of the equitability property. Last, we also demand that the measure should be defined in a way that is practically computable, in that the estimation does not, e.g., require somewhat arbitrary truncations like in the case of transfer entropy Schreiber2000 . Due to these properties our approach can be used to reconstruct interaction networks where not only the links are causal, but are also meaningfully weighted and have the attribute of a coupling delay. This serves as an important feature in inferring physical mechanisms from interpreting interaction networks.The first requirement, generality, is fulfilled by any information theoretic measure like mutual information (MI) and conditional mutual information (CMI) Cover2006 . These measures also fulfill the axioms for dependency measures proposed in Schweizer1981 . Additionally to generality, the authors in Reshef2011 demonstrate that their algorithmically motivated maximal information coefficient fulfills the property of equitability. However, apart from issues with statistical power Gorfine2012 , a crucial drawback of their measure is, that it is not clear how to extend it to the multivariate case. There are few works considering a concept of coupling strength in the multivariate context of causality. In Jachan2009a ; Schelter2009 this problem is approached in the linear framework of partial directed coherence and in Chen2004 ; Marinazzo2008 using the less restricted, yet still modelbased, concept of Granger causality, all sharing the problem that the model might be misspecified. Transfer entropy (TE) Schreiber2000 is the informationtheoretic analogue of Granger causality Barnett2009 and the issue of arbitrary truncations has been addressed in Faes2011 and in our previous article Runge2012prl . Still the problem with TE is that it is not lagspecific which can lead to false interpretations like in the case of feedbacks ay2008information and, as we will demonstrate analytically and numerically in this article, it is not uniquely determined by the interaction of the two components alone and depends on misleading effects of, e.g., autodependency and the interaction with other processes. In essence, it does not fulfill the proposed property of coupling strength autonomy. In Janzing2012 another informationtheoretic approach, based on a different set of postulates, is discussed.
Our approach to a measure of a causal coupling strength is based on the fundamental concept of source entropy Shannon1963 and for the special case of bivariate ordinal pattern time series the momentary information transfer (MIT) has been introduced recently in Pompe2011 . In this article we utilize the concept of graphical models to mathematically formalize and generalize MIT to the multivariate case. We demonstrate that MIT is practically computable and fulfills the properties of generality, causality and coupling strength autonomy, while the more complex property of equitability will only partially be addressed here.
The determination of a causal coupling strength in our approach is a twostep process. In the first step the graphical model is estimated as detailed in Runge2012prl which determines the existence or absence of a link and thus of a causality between lagged components of the multivariate process. The second step – the main topic of the present paper – is the estimation of MIT as a meaningful weight for every existing link in the graph.
The article is organized as follows. In Sect. II we define and review TE and the decomposed transfer entropy introduced in Runge2012prl . In Sect. III we introduce the important concept of graphical models and in Sect. IV we define MIT and related measures. All of these measures are compared analytically (Sect. V), leading to the coupling strength autonomy theorem (Sect. VI), and numerically (Sect. VII). Finally, we discuss limitations (Sect. VIII) and provide an application to climatological data that shows the potential of our approach (Sect. IX). The appendices provide proofs and further discussions.
Ii Transfer Entropy and the curse of dimensionality
Before introducing MIT, we will discuss the wellknown TE and its shortcomings. We will focus on multivariate time series generated by discretetime stochastic processes and use the following notation: Given a stationary multivariate discretetime stochastic process , we denote its uni or multivariate subprocesses
and the random variables at time
as . Their pasts are defined as and . For convenience, we will often treat , , , and as sets of random variables, so that, e.g., can be considered a subset of . Now the TE [see Fig. 1(a)](1) 
is the reduction in uncertainty about when learning the past of , if the rest of the past of , given by , is already known (where “” denotes the subtraction of a set). Note that, because of the assumed stationarity, is independent of . TE measures the aggregated influence of
at all past lags and is not lagspecific. The definition of TE leads to the problem that infinitedimensional densities have to be estimated, which is commonly called the “curse of dimensionality”. In the usual naive estimation of TE the infinite vectors are simply truncated at some
leading to(2) 
where (correspondingly for ) and has to be chosen at least as large as the maximal coupling delay between and , which can lead to very large dimensions. In our numerical experiments we will demonstrate that the choice of a truncation lag , which affects the estimation dimension via (where is the number of processes), has a strong influence on the value of TE and affects the reliability of causal inference. This is a huge disadvantage because the coupling delay should not have an influence on the measured coupling strength.
In Runge2012prl the problem of high dimensionality is overcome by utilizing the concept of graphical models that will be introduced in the next section. In this framework a decomposed transfer entropy (DTE) is derived that enables an estimation using finite vectors
(3) 
for a certain finite set [see Fig. 1(b)] and with chosen as the smallest for which the estimated remainder is smaller than some given threshold. Another approach to find a truncation is described in Faes2011 . While thereby the somewhat arbitrary truncation lag is avoided and the estimation dimension is drastically reduced, it can still be quite high (in the still rather simple model example of Runge2012prl the maximum dimension was 24).
The summands in Eq. (3) can be seen as the contributions of different lags to TE, but should not be interpreted as lagspecific causal contributions because they can be nonzero also for lags for which there is no link in the graph. Finally, apart from the issue of high dimensionality and lagspecific causality, we will demonstrate in Sect. V that TE or DTE also do not fulfill the proposed coupling strength autonomy property. In the next section we introduce the important concept of graphical models from which we derive MIT and related measures.
Iii Graphical Models and Causality
In the graphical model approach lauritzen1996graphical ; Dahlhaus2000 ; Eichler2011 the conditional independence properties of a multivariate process are visualized in a graph, in our case a time series graph. This graph thus encodes the lagspecific causality with respect to the observed process. As depicted in Figs. 1 and 2(b), each node in that graph represents a single random variable, i.e., a subprocess, at a certain time . Nodes and are connected by a directed link “” pointing forward in time if and only if and
(4) 
i.e., if they are not independent conditionally on the past of the whole process, which implies a lagspecific causality with respect to . If we say that the link “” represents a coupling at lag , while for it represents an autodependency at lag . Nodes and are connected by an undirected contemporaneous link (visualized by a line) Eichler2011 if and only if
(5) 
where also the contemporaneous present is included in the condition. In the case of a multivariate autoregressive process as defined later in Eq. (40), this definition corresponds to nonzero entries in the inverse covariance matrix of the innovations . Note that stationarity implies that “” whenever “” for any .
Like TE, the CMIs given by Eq. (4) and (5) involve infinitedimensional vectors and can thus not be directly computed, but only involving truncations. As shown in Sect. VII, this measure therefore suffers from the problem of high dimensionality and also theoretically does not fulfill the coupling strength autonomy property as analyzed in Sect. V.
On the other hand, one can exploit the Markov property and use the finite set of parents defined as
(6) 
of [blue box in Fig. 2(b)] which separate from the past of the whole process . The parents of all subprocesses in together with the contemporaneous links comprise the time series graph. In Runge2012prl an algorithm for the estimation of these time series graphs by iteratively inferring the parents is introduced. In the Supplementary Material of Runge2012prl we also describe a suitable shuffle test and a detailed numerical study on the detection and false positive rates of the algorithm. The Markov properties hold for models sufficing the very general condition (S) in Eichler2011 .
The determination of a causal coupling strength now is a twostep procedure. In the first step the time series graph is estimated as detailed in Runge2012prl which determines the existence or absence of a link and thus of a causality between lagged components of . The second step is the determination of a meaningful weight for every existing link in the graph. The MIT introduced in the next section is intended to serve this aim by attributing a well interpretable coupling strength solely to the inferred links of the time series graph.
Iv Momentary information transfer and source entropy
The parents of a subprocess at a certain time are key to understand the underlying concept of source entropy. Each univariate subprocess of a stationary multivariate discretetime stochastic process will at each time yield a realization . The entropy of measures the uncertainty about before its observation, and it will in general be reduced if a realization of the parents is known. But for a nondeterministic process, and most real data will at least contain some random noise, there will always be some “surprise” left when observing . This surprise gives us information and the expected information is called the source entropy of . Now the MIT between at some lagged time in the past and at time is the CMI that measures the part of source entropy of that is shared with the source entropy of :
(7) 
This approach of “isolating source entropies” is sketched in a Venn diagram in Fig. 2(a). The attribute momentary Pompe2011
is used because MIT measures the information of the “moment”
in that is transferred to . This “momentariness” is closely related to the property of coupling strength autonomy as we will show in the next sections. Similarly to the definition of contemporaneous links in Eq. (5), we can also define a contemporaneous MIT(8) 
where denotes the contemporaneous neighbors given by
(9) 
and correspondingly for and their parents. Due to Markov properties the contemporaneous MIT is equivalent to the formula defining contemporaneous links Eq. (5). This is, however, not the case for the lagged MIT. Like any (C)MI, MIT is sensitive to any kind of statistical association and therefore guarantees the property of generality. Because MIT uses the parents as conditions, it also fulfills the property of lagspecific causality in that it is nonzero only for lagged processes that are not independent conditional on .
As related measures, we can also choose either one of the parents as a condition, which – dropping the attribute “momentary” – leads to the information transfers ITY and ITX
(10)  
(11) 
ITY isolates only the source entropy of . Like MIT it is nonzero only for dependent nodes (and therefore fulfills the properties of generality and causality) and used in the algorithm to estimate the time series graph Runge2012prl . ITX measures the part of source entropy in that reaches on any path and is, thus, not a causal measure, yet in many situations we might only be interested in the effect of on , no matter how this influence is mediated. For these three CMIs are related by the inequality
(12) 
which holds under the “no sidepath”constraint as specified in Sect. VI. The proof is given in the appendix. The very definition of MIT, ITY and ITX already leads to a lowdimensional estimation problem without arbitrary truncation parameters. Further, the underlying theory of time series graphs allows for an analytical evaluation of the properties of these measures as we will demonstrate in the following section. See ^{1}^{1}1A Pythonscript to estimate the time series graph, MIT and related measures can be obtained from the website http://tocsy.pikpotsdam.de/tigramite.php. for software to compute the time series graph, MIT and related measures.
To clarify, each of the CMIs introduced in the preceding sections are intended to measure a different aspect of the coupling between and . In the following analytical analysis of simple models we will discuss the interpretability of the different measures.
V Analytical comparison
To motivate our choice of a measure of coupling strength and to clarify the important coupling strength autonomy property, we discuss an analytically tractable model of a multivariate Gaussian process:
(13) 
with independent Gaussian white noise processes
with variances
. The corresponding time series graph is depicted in Fig. 1 and the parents are and . Generally, the conditional entropy of a dimensional Gaussian process conditional on a (possibly multivariate) process is given by(14) 
where is the determinant of the covariance matrix of . In our case is univariate and thus . The variances and covariances needed to evaluate the determinants and detailed derivations for the following formulas are given in the appendix.
First, we analyze TE given by Eq. (1). TE can be written as the difference of conditional entropies
(15) 
where the latter entropy, conditioned on the whole infinite past, is actually the source entropy of and can be much easier computed by exploiting the Markov property
(16) 
which yields, using Eq. (14),
(17) 
The source entropy of is therefore given by the entropy of the innovation term . In the first entropy term, on the other hand, the infinite vector cannot be treated that easily and we have to evaluate the determinants of infinite dimensional matrices in
(18) 
However, for the special case of , i.e., no input processes apart from the autodependency in , the quotient of these matrices can be simplified to the quotient of infinite Toeplitz matrices. As shown in the appendix, we can then apply Szegö’s theorem szegoe ; boettcher2006 and get
(19) 
Another tractable case is for which the blocks of the covariance matrix become diagonal and
(20) 
Thus, in the first case the value of TE for our model depends on the autodependency coefficient and in the second case on the coupling coefficient and variance of . But why should a measure of coupling strength between and depend on internal dynamics of and, even more so, on the interaction of with another process ? While it can be informationtheoretically explained, it seems rather unintuitive for a measure of coupling strength between and .
Next, we compute the CMI that defines links in a time series graph. Writing Eq. (4) for as a difference of conditional entropies, the second term is again the source entropy as given by Eq. (17) and in this case also the first entropy can be simplified using the Markov property
(21) 
to arrive at a finite covariance matrix from which a lengthy computation yields
(22) 
Again, also this measure of coupling strength depends on the coefficients belonging to other coupling and autodependency links.
We now turn to the measures that solely use the parents as conditions which has the analytical and numerical advantage of low dimensional computations. The resulting expressions for the CMI with no conditions, i.e., the mutual information (MI), and for either one of the parents as a condition for are
(23)  
(24)  
(25) 
Thus MI depends on the coefficients and variances of the input processes, while ITX and ITY still depend at least on the coefficient and variance of the process that is not conditioned on. Contrary to TE and LINK though, neither of the three measures depends on the interaction with . In our model the inputs to and , i.e., the autodependency with and the external input from , are independent which makes the formulas much simpler.
Finally, the MIT for is
(26) 
which solely depends on the model coefficients that govern the source entropies, i.e., the variances , and the coupling coefficient .
This equation can be proven to hold for arbitrary multivariate linear autoregressive processes under the “no sidepath”constraint specified in the next section. More generally, for a class of additive models MIT depends only on the coupling coefficient and the source variances of and as will be proven in the coupling strength autonomy theorem in the next section.
But can a coupling strength always be associated with only one coupling coefficient ? In the following – still linear – example model visualized in Fig. 3(a) this is not the case:
(27) 
where the influence of on has two paths: One via the direct coupling link “” and one via the path “” such that we can rewrite
(28) 
from which we see, that the coupling cannot be unambiguously related to one coefficient. Here, MIT at is
(29) 
and depends not only on , but also on the coefficient of the link “”, and on the variance of . In this case it might be more appropriate to “leave open” both paths and exclude from the conditions which – only in this case – reduces the modified MIT to the MI
(30) 
Here the sum is the covariance along both paths, which can also vanish for , and seems like a more appropriate representation of the coupling between and .
Another example where one cannot unambiguously relate the coupling strength to one coefficient is for a nonlinear dependency between and [Fig. 3(b)]:
(31) 
If we express explicitly in terms of the source variance of and the parent of
(32) 
we note that due to the term the effect of is not additively separable from the source process . In the Venn diagram of Fig. 2(a) this “mixing” of entropies implies that the parts of the entropies and that overlap with are not distinguishable anymore, which could be visualized by the red and light gray shadings bleeding into one another. Therefore the coupling should be considered as emanating from rather than alone [visualized by a thick arrow in Fig. 3(b)]. For this nonlinear model we have not found an analytical expression for MIT, but the more general case of this model is studied numerically in the appendix.
These two examples point to constraints under which full coupling strength autonomy can be reached. In the next section we will formalize these constraints to general conditions in a theorem of coupling strength autonomy.
Vi Coupling strength autonomy theorem and modifications of MIT
Let , be two subprocesses of some multivariate stationary discretetime process sufficing condition (S) in Eichler2011 with time series graph as defined in Sect. III and coupling link “” for . The following derivations also hold for more than one link at lags between and . As before, we denote their parents and . For the link “” we define the following conditions:

Additivity means that the dependence of on its source process and parents and of on its source process , and the remaining parents is additive, i.e., they can be written as
(33) (34) for possibly multivariate random variables and , univariate i.i.d. random variables and with arbitrary, not necessarily identical distributions, and arbitrary functions .

Linearity in f: The dependence of on is linear, i.e., with real .

“No sidepath”constraint, i.e., in the time series graph the node is separated from given (for a formal definition of paths and separation see Eichler2011 ). Since due to condition (S) in Eichler2011 separation implies conditional independence
(35)
Theorem (Coupling Strength Autonomy). MIT defined in Eq. (IV) for the coupling link “” for of a multivariate stationary discretetime process sufficing condition (S) in Eichler2011 has the following dependency properties:

If all three conditions (1)(3) hold, then MIT can be expressed as an MI of the source processes:
(36) Since and
are assumed to be independent, the probability density of their sum is given by their convolution. The MIT thus depends solely on
and the joint and marginal distributions of and the convolution of with . 
If only conditions (1) and (2) hold, i.e., there exists a sidepath between and some nodes in , then MIT depends additionally on the distributions of at least the “sidepathparents” in and their functional dependence on :
(37) This relation can be further simplified if is additive in some parents.

If only the additivity condition (1) holds, i.e., is nonlinear and mixes with the parents then MIT depends additionally on , the distributions of variables in as well as and their functional dependencies on :
(38) This relation can be further simplified if some parents in are independent of .
For a contemporaneous link “” the contemporaneous MIT defined in Eq. (IV) under the condition (1) is:
(39) 
A contemporaneous link cannot have sidepaths. For MIT measures the autodependency strength. The proofs are given in the appendix.
We now discuss some remarks on the theorem and possible modifications of MIT:

For the special case of multivariate linear autoregressive processes of order brockwell2009time defined by
(40) with the coupling coefficient at lag corresponding to the connectivity matrix entry , and with no sidepaths, Eq. (36) leads to
(41) generalizing the MIT for our analytical model in Eq. (26). For an autodependency at lag with coefficient and no sidepaths the MIT is , independent of the source variance .

The form Eq. (41) is reminiscent of the ShannonHartley theorem in communication theory Cover2006 . There the coupling strength corresponds to the communication channel capacity which is given by the maximum MI over all possible input sources: . The ShannonHartley theorem for Gaussian channels then reads
(42) with bandwidth and signaltonoise ratio , which in Eq. (41) corresponds to . The difference to our measure of coupling strength is that we cannot manipulate the input sources and thus cannot measure the channel capacity alone. We also expressed the various other CMIs occuring above in this form, where the quotient can be interpreted as a signaltonoise ratio. For example, in Eq. (25) is the signal strength and is the noise strength.

For sidepaths, i.e., under the conditions (1) and (2) only, the example of MIT and the modified MIT for the case of our model example Eq. (V) point to the suggestion, that it might be more appropriate to “leave open” all paths from to by excluding those parents of that are depending on , i.e.,
(43) but additionally including the parents of these sidepath parents. In this way the couplings via the direct link “” and the path “” (the symbol “” denotes that the link from to the sidepath parents can either be directed or contemporaneous) are isolated from the effects of their parents. The modified MIT we call MITS where “S” stands for “sidepath”:
(44) 
For nonlinear dependencies one could modify MIT to the CMI between and the joint vector leading to MITN where “N” stands for “nonlinear”:
(45)
These modifications will be studied in a separate paper.
The theorem implies that under the conditions (1)(3) the MIT is independent of other coefficients belonging to other links. If this holds for all coupling strengths of all links in the model, then the MITs are independent in a functional sense. Note, however, that all coupling strengths of links emanating from the same process will depend on the source variance of . Thus, MIT somewhat disentangles the coupling structure, which is exactly the coupling strength autonomy that makes MIT well interpretable as a measure that solely depends on the “coupling mechanism” between at lag and , autonomous of other processes. One such possible misleading input “filtered out” by MIT is autocorrelation, or, more generally, autodependency as will be shown in the numerical experiments and the application to climatological data. In the next section we investigate the coupling strength autonomy property numerically.
Vii Numerical Comparison
In the following we compare MI, TE, MIT and related measures numerically to investigate the properties of generality and coupling strength autonomy for a general class of nonlinear discretetime stochastic multivariate processes:
(46) 
with independent Gaussian white noise processes with all variances . The corresponding time series graph is depicted in Fig. 2(b). We estimate the various coupling measures for fixed and and vary the input coefficients
and functional dependencies of inputs
Here we depict results for linear such that the multivariate process suffices all three conditions, a nonlinear dependency type is discussed in the appendix. The ensemble then consists of all combinations of input coefficients and functional forms, each combination run with 120 trials. The CMIs are estimated using a nearestneighbor (NN) estimator Kraskov2004a ; FrenzelPompe2007 with parameter (small values of lead to a lower estimation bias but higher variance Kraskov2004a ; FrenzelPompe2007 ).
In the top panel of Fig. 4(a) we plot the ensemble average for fixed for the following measures with : MI (gray with dotted line), ITY according to Eq. (10) (green with dashdotted line), ITX according to Eq. (11) (blue with dashed line) and MIT according to Eq. (IV) (red with solid line). The parents are shown in Fig. 2(b).
MIT is largely invariant to changes of the remaining coefficients and and approximately attains the analytical value for zero input coefficients [given by Eq. (26) for and ]: . This implies that the MIT of the coupling link is autonomous of the MITs corresponding to the input links for and for