Causality is a concept that expresses the joint behavior in time of a group of components in a system. In general, it denotes the effect of one component to itself and others in the system during a time period. Consider a network of nodes, each producing a signal in time. These processes can behave independently, or there might be an underlying connection, by nature, between them. Inferring this structure is of great interest in many applications. In 
, for instance, neurons are taken as components while the time series of produced spikes is used to derive the underlying structure. Dynamical models are also a well-known tool to understand functionals of expressed neurons. Additionally, in social networks, there is an increasing interest to estimate influences among users , while further applications exist in biology , economics , and many other fields.
defined the notion of causality between two time series by using a linear autoregressive model and comparing the estimation errors for two scenarios: when history of the second node is accounted for and when it is not. With this definition, however, we can poorly estimate models which operate non-linearly. Directed information was first introduced to address the flow of information in a communication set-up, and suggested by Massey as a measure of causality since it is not limited to linear models. There exist other methods which may qualify for different applications. Several of these definitions are compared in , where directed information is argued as a robust measure for causality. There are also symmetric measures like correlation or mutual information, but they can only represent a mutual relationship between nodes and not a directed one.
The underlying causal structure of a network of processes can be properly visualized by a directed graph. In particular, in a Directed Information Graph (DIG) –introduced simultaneously by Amblard and Michel  and Quinn et al. – the existence of an edge is determined by the value of the directed information between two nodes considering the history of the rest of the network. There are different approaches to tackle the problem of detecting and estimating such type of graphs. Directed information can be estimated based on prior assumptions on the processes’ structure, such as Markov properties, and empirically calculating probabilities [12, 11]. On the other hand, Jiao et al.  propose a universal estimator of directed information which is not restricted to any Markov assumption. Nonetheless, in the core of their technique, they consider a context tree weighting algorithm with different depths, which intuitively resembles a learning algorithm for estimating the order of a Markov structure. Other assumptions used in the study of DIGs, that constrain the structure of the underlying graph, are tree structures  or a limit on the nodes’ degree .
The estimation performance on the detection of edges on a DIG is crucial since it allows to characterize, for instance, the optimum test for detection, or the minimum number of samples needed to reliably obtain the underlying model, i.e., the sample complexity. In , the authors derive a bound on the sample complexity using total variation when the directed information between two nodes is empirically estimated. Following that work, Kontoyiannis et al.  investigate the performance of a test for causality between two nodes, and they show that the convergence rate of the empirical directed information can be improved if it is calculated conditioned on the true relationship between the nodes. In other words, the underlying structure of the true model has an effect on the detection performance of the whole graph. Motivated by this result, in this paper, we study a hypothesis test over a complete graph (not just a link between two nodes) when the directed information is empirically estimated, and we provide interesting insights into the problem. Moreover, we show that for every existing edge in the true graph, the estimation converges with , while if there is no edge in the true model, convergence is of the order of .
The rest of the paper is organized as follows. In Section II, notations and definitions are introduced. In particular, the directed information is reviewed and the definition of an edge in a DIG is presented. The main results of our work are then shown in Section III. First, the performance of a hypothesis test for a single edge is studied, where we analyze the asymptotic behavior of estimators based on the knowledge about the true edges. Then, we demonstrate how the detection of the whole graph relies on the test for each edge. Finally, in the last section, the paper is concluded.
Assume a network with nodes representing processes . The observation of the -th process in the discrete time interval to
is described by the random variable, which at each time takes values on the discrete alphabet . With a little abuse of notation and represent the observations of the process Y at instance and in the interval to , respectively.
The metric used to describe the causality relationship of these processes is the directed information, as suggested previously, since it can describe more general structures without further assumptions (such as linearity). Directed information is mainly used in information theory to characterize channels with feedback and it is defined based on causally conditioned probabilities.
The probability distribution of
The probability distribution ofcausally conditioned on is defined as
The entropy rate of the process Y causally conditioned on X is then defined as:
Consequently, the directed information rate of X to Y causally conditioned on Z is expressed as below:
Pairwise directed information does not unequivocally determine the one-step causal influence among nodes in a network. Instead, the history of the other remaining nodes should also be considered. Similarly as introduced in [12, 1], an edge from to exists in a directed information graph iff
where . Having observed the output of every process, the edges of the graph can be estimated which results in a weighted directed graph. However, when only the existence of a directed edge is investigated the performance of the detection can be improved. This is presented in Section III.
There exist several methods to estimate information theoretic values which most of them intrinsically deal with counting possible events to estimate distributions. One such method is the empirical distribution, which we define as follows.
By plugging in the empirical distribution we can derive estimators for information theoretic quantities such as the entropy , where we use the notation to distinguish the empirical estimator, i.e.,
A causal influence in the network implies that the past of a group of nodes affects the future of some other group or themselves. This motivates us to focus on a network of joint Markov processes in this paper, since it characterizes a state dependent operation for nodes, although we may put further assumptions to make calculations more tractable. For simplicity, we assume a three-node network (i.e., ) and the processes to be , and Z in the rest of the work, since the extension of the results for is straightforward.
is a jointly stationary Markov process of order .
Let us define the transition probability matrix with elements
Then, the next assumption prevents further complexities in the steps of the proof of our main result.
All transition probabilities are positive, i.e., .
This condition provides ergodicity for the joint Markov process and results in the joint empirical distribution asymptotically converging to the stationary distribution111The stationary distribution is denoted either as or as in the sequel., i.e.,
In general, the directed information rate cannot be expressed with the stationary random variables , and , since a good estimator requires unlimited samples for perfect estimation. To see this,
where we use the Markov property in the first equation, and the inequality holds since the mutual information is non-negative. Thus, with a limited sampling interval, an upper bound would be derived. The next assumption makes (5) hold with equality.
For processes Y and Z , the Markov chain
, the Markov chain
Note that the above assumption should hold for every other two pairs of processes if we are interested in studying the whole graph and not only a single edge.
Iii Hypothesis test for Directed Information Graphs
Consider a graph , where the edge from node to node is denoted by ; we say that if the node causally influences the node , otherwise, . A hypothesis test to identify the graph is performed on the adjacency matrix , whose elements are the s, and the performance of the test is studied through its false alarm and detection probabilities
where is the estimation of (properly defined later), and is the hypothesis model to test. In Theorem 1 below, both an upper bound on and a lower bound on are derived.
For a directed information graph with adjacency matrix of size , if Assumptions – hold, the performance of the test for the hypothesis is bounded as:
using the plug-in estimation of samples with . The function is the regularized gamma function, and with denoting the number of directed edges in the hypothesis graph, and . Finally, is the threshold value used to decide the existence of an edge, and its order is .
The proof of Theorem 1 consists of two steps. First, the asymptotic behavior of the test for a single edge is derived in Section III-A. Afterwards, the hypothesis test over the whole graph is studied based on the tests for each single edge. It can be seen later on that, by Remark 1 and Corollary 1, the performance of testing the graph remains as good as a test for causality of a single edge.
Note that by increasing while remaining of order , gets arbitrarily close to one, which results in the probability of detection converging to one as the probability of false alarm tends to zero.
Iii-a Asymptotic Behavior of a Single Edge Estimation
In general, every possible probability transition matrix can be parametrized with , where (see Table I
). The vectoris formed by concatenating the elements of in a row-wise manner excluding the last (linearly dependent) column. However, if the transition probability could be factorized due to a Markov property among its variables, the matrix might thus be addressed with a lower dimension parameter.
To see this, let us concentrate in our 3-node network . If , or equivalently , then by Assumption 3, the transition probability can be factorized as follows,
Here the transition matrix is parametrized by where has two components: and , and . The dimensions of the sets are shown in Table I; note that . The vectors and are also formed by concatenating the elements of their respective matrices as in the case of . More details are found in the proof of Theorem 2 in Appendix A.
Now consider the Neyman-Pearson criteria to test the hypothesis within .
The log-likelihood is defined as
be the most likely choice of transition matrix with general and null hypothesis, respectively, i.e.,
As a result, the test for causality boils down to check the difference between likelihoods, i.e., the log-likelihood ratio:
which is the Neyman-Pearson criteria for testing within . Then, in the following theorem, is shown to converge to a distribution of finite degree. The proof follows from standard results in [2, Th. 6.1].
Note that the asymptotic result from Theorem 2 depends only on the dimensions of the sets and not in the particular pair of nodes involved. Furthermore, the result also holds for a network with more than three nodes by properly defining the dimensions of the sets.
Knowledge about the absence of edges other than in the network results in converging to a distribution of higher degree since (III-A) could be further factorized. To see this, assume and consider that a knowledge about the links (for example, the whole adjacency matrix ) was given. Then, let the transition probability be factorized as much as possible, so it can be parametrized by which has lower or equal dimension than . Take
Intuitively, by following similar steps as in the proof of Theorem 2, we obtain that behaves as a random variable, where . Since the cumulative distribution function of the
. Since the cumulative distribution function of theis a decreasing function with respect to the degree then,
for sufficiently large and any . The lower bound in (13) allows us to ignore the knowledge about other nodes.
Consider now the estimation of the directed information defined as plugging in the empirical distribution (instead of the true distribution) into , i.e.,
Then, the following lemma states that , is proportional to with an factor.
, which is the plug-in estimator of the directed information.
The proof follows from standard definitions and noting that the KL-divergence is positive and minimized by zero. See Appendix B for the complete proof. ∎
Now, let us define the decision rule for checking the existence of an edge in the graph as follows:
where is of order . Then for any knowledge about states of edges in the true graph, as long as we have:
where the inequality is due to Remark 3.
From Theorem 2 and Lemma 1, it is inferred that when in the true adjacency matrix , then the empirical estimation of the directed information converges to zero with a distribution at a rate . The asymptotic behavior of is different if the edge is present, i.e., , which is addressed in the following theorem.
The empirical distribution can be decomposed in two parts, where the first one vanishes at a rate faster than and the second part converges at a rate . The condition keeps the second part positive so it determines the asymptotic convergence of . Refer to Appendix C for further details. ∎
Knowledge about the state of other edges in the true graph model does not affect the asymptotic behavior presented in Theorem 3, given that the condition makes the convergence of the estimator independent of all other nodes. This can be seen by following the steps of the proof, where we only use the fact that if the true edge exists then and (III-A) does not hold.
Iii-B Hypothesis Test over an Entire Graph
The performance of testing a hypothesis for a graph is studied by means of the false alarm and detection probabilities defined in (6) and (7), respectively. The results may be considered as an extension of the hypothesis test over a single edge in the graph.
First, let the false alarm probability be upper-bounded as
If , there exist nodes and such that . Hence,
On the other hand, the complement of the detection probability may be upper-bounded using the union bound:
Since and it is of order , then as and noting that
we have that,
This concludes the proof of Theorem 1.
Iii-C Numerical Results
The bounds derived in Theorem 1 state that the detection probability can be desirably close to one while the false alarm probability can remain near zero with a proper threshold test. In Fig. 1, these bounds are depicted with respect to different values of for a network with nodes. The joint process is assumed to be a Markov process of order , and the random variables take values on a binary alphabet ().
It can be observed in the figure that, for fixed , improves as decreases, i.e., when the graph becomes sparser. Furthermore, by a proper choice of , we can reach to optimal performance of the hypothesis test, i.e., and .
Iv Summary and Remarks
In this paper, we investigated the performance of a hypothesis test for detecting the underlying directed graph of a network of stochastic processes, which represents the causal relationship among nodes, by empirically calculating the directed information as the measure. We showed that the convergence rate of the directed information estimator relies on the existence or not of the link in the real structure. We further showed that with a proper adjustment of the threshold test for single edges, the overall hypothesis test is asymptotically optimal.
This work may be expanded by considering a detailed analysis on the sample complexity of the hypothesis test. Moreover, we assumed in this work that the estimator has access to samples from the whole network while in practice this might not be the case (see e.g. ).
Appendix A Proof of Theorem 2
For any right stochastic matrixof dimensions , let the matrix denote the first linearly independent columns of , as depicted in Fig. 2.
Without loss of generality, consider to be the set of integers which simplifies the indexing of elements in the alphabet. Let denote , and excluding . Next define the vector
which is associated with an element of (the sub-matrix of the transition probability matrix ).
Every can be addressed via the pair where and indicate the row and column of that element, respectively. Also, let , which denotes the index of that element when vectorizing . Any possible transition matrix can then be indexed with a vector
as , where has dimension (see Table I) and is constructed by concatenation of rows in .
Suppose now that or equivalently, by definition (2), . Then
Thus, the transition matrix is determined by the elements of two other matrices and given by (A). Define the vectors
which are associated with an element in and , such that in and in . Then
where the pairs of row and column indices for each element in and are then and , respectively.
determine , are vectors of length and , and are constructed by concatenating the rows of and , respectively. There exists then the mapping such that component-wise:
Consider the matrix of size such that for every element:
In other words, every row of the matrix is a derivative of an element of with respect to all elements of followed by the derivatives with respect to .