Decisions to set policies and conduct interventions ought to be supported by evidence and scenario analysis. Supporting decision making in this way, however, is difficult since it is usually not possible to conduct randomized controlled trials over policy decisions such as a country’s investment in sanitation or primary education and see the effect on indicators such as number of patents produced or tuberculosis mortality rate. Understanding the causal structure of the many different variables, indicators, and possible interventions relevant to such decisions is challenging because of their intricate interdependencies and the small cardinality of noisy samples coupled with a large number of variables. One approach for developing such understanding is through painstaking theorization and validation of small sets of variables over many years 
Existing approaches for the problem of structure learning from time series observations include the Granger Causality (GC) algorithm . This has been recently formalized in terms of directed information graphs [3, 4] as a Bayesian network structure recovery problem on time series. The GC approach has good statistical properties because it conditions on all other variables to isolate the pair in question . However, with finite samples and very large number of variables, the statistical power of the algorithm significantly reduces due to large conditioning sets. Inspired by the Peter and Clark (PC) algorithm for causal discovery for non-time series data, recently proposed variants perform tests that condition on a reduced subset, beginning with a complete graph and pruning it with pairwise tests ; this approach yields many false positives while also having scaling issues.
In this paper, we propose a new algorithm, MMPC-p, that is scalable and has provably strong p-value control to prevent false discoveries using techniques from [7, 8]. This proposed algorithm is inspired by the MMHC algorithm for causal structure learning in non-time series observational settings . It begins with an empty graph, adds edges to form candidate parent sets, and subsequently prunes them in a two-phase approach. We show it to be sound, complete and equip it with false discovery rate (FDR) control under assumptions we describe in the sequel.
The proposed MMPC-p algorithm relies on some form of independency testing on pairs of random processes. Due to autocorrelations across time, we cannot use conditional independence tests directly. We consider two testers based on directed information, an information-theoretic measure of predictive information flow between processes for linear Gaussian models. The first is a naïve directed information test that ignores correlations across time but requires fewer samples. The second computes directed information in a more principled way but requires a greater number of samples.
We conduct a detailed comparison study of GC, PC, and MMPC-p on generated data and find MMPC-p to have better scaling and error control empirically. From our synthetic experiments, we find that MMPC-p has higher statistical power for sparse large graphs than the alternatives: GC and PC. We also apply MMPC-p to a real-world data set of global development indicators from 186 countries over more than fifty years and compare the learned causal relationships to the validated relationships in the International Futures (IFs) program . There are systematic differences in the two sets of relationships that we detail later, but the ones found by the proposed algorithm have some validity from the policy expert perspective.
The main contributions of this work are: (1) an MMPC-p algorithm for time series data inspired by the MMHC algorithm, (2) a method to control false discoveries with our approach under weak assumptions on Type II error, (3) exhaustive experiments comparing the performance of MMPC-p with the modified PC and modified GC algorithms (we show that MMPC-p performs well for large sparse graphs in terms of both omission and commission errors), and (4) a case study on a global development dataset with input from subject matter experts.
Ii Related Work
Most recent work in causal structure learning has focused on issues related to undersampling. In [10, 11], causal time scale structures are learned from subsampled measurement time scale graphs and data. A variant of this was studied in , where the authors address the issue of causal structure learning from temporally aggregated time series. We consider the learning problem at measurement time scales only. Recent work  has drawn attention to the need for evaluating algorithms on the measurement time scale problem, which is used by some of the algorithms that deal with under sampling. Regression-based methods have also been used for estimating the causal impact , which is quantified as the counterfactual response in a synthetic control setting. In contrast, here we focus on the structure learning aspect of the problem. In , the PC algorithm is extended for time series under the assumption that the measurement and time scales were approximately equal. Another variant called modified PC has been presented in . We actually compare our results to this variant in the empirical section. In , the authors present techniques for estimating multivariate Granger causality from time series data, both unconditional and conditional, in the time and frequency domains. In contrast,  explores combining graphical modeling with Granger causality to address climate change problems. These papers use the Granger causal algorithm that conditions on all the variables but the pair of variables in questions. We actually compare to a variant of these approaches as described in . More recently, the importance of FDR control is being emphasized in causal structure learning problems. An approach for Bayesian network structure learning with FDR control was presented in . In , the authors present p-value estimates for high-dimensional Granger causal inference under assumptions of sub-Gaussianity on the coefficients. PC-p  is an extension of PC which computes edge-specific p-values and controls the FDR across edges. In contrast to these algorithms, our MMPC-p is for time series, works under general assumptions and is inspired by the MMHC algorithm. We also formally prove FDR control guarantees and back it up with results in our empirical section.
Iii Formal Problem Definition and Preliminaries
Notation. Consider random processes over time slots such that the th random process sequence is denoted by:
Further, let denote the sequence up to time . Let denote the th random process from time to . Consider a subset of random processes
. Then, the random variables of all the processes infrom time to are denoted . denotes the random variables belonging to the set of processes at time . Let and denote the quantities analogous to the single random process case as described above.
We will primarily consider random processes that take values in a finite alphabet. This is to simplify the presentation, avoiding all measure-theoretic issues. The experiments, however, are performed with respect to real-valued random processes. We make the following assumptions.
Assumption 1 (Strict Causality).
The random processes follow the dynamics:
The dynamics are order-1 Markov. This supposes that there are no instantaneous interactions in the system conditioned on the past, i.e. the system is strictly causal.
Here, the notation subsumes the recursive causal conditioning over time in (1).
Assumption 2 (Causal Sufficiency).
There are no hidden confounders and all variables are measured.
Assumption 3 (Positivity).
The joint distribution satisfies:
The joint distribution satisfies:for all possible realizations in the domain.
Causally conditioned directed information from random process to conditioned on the random processes in the set is given by:
Here, represents the standard conditional mutual information measure in information theory.
In other words, it is the time average of the mutual information between process until time and process at time given the past of processes in until time . It is related to Granger causality, signifying the reduction in prediction loss that process until gives over and above the processes in until time . The notion is exact for prediction under log loss. However,  presents arguments as to why the log loss is the correct metric for measuring value of extra side information in prediction as only this measure satisfies a data-processing axiom.
Directed Information (DI) graph is a DAG associated with the random processes is defined as follows: a directed edge iff .
This is a graph where every node is a random process. We interchangeably use and when talking about nodes in the graph . Let be the set of directed parents of node in the DI graph . Let be the set of children of .
Let be the set of directed parents according to the DI graph. Then, if the positivity condition holds for all random processes over time and if the system is strictly causal, then almost surely:
When the system of random processes satisfies the positivity constraint and satisfies strict causality:
If , then , for all sets .
Iv Algorithm: MMPC-p
Inspired by the MMHC algorithm for observational causal discovery  with i.i.d. data, we introduce an adaptation called the MMPC-p algorithm (max-min parents) for Granger causality. The MMPC-p algorithm uses a DI Tester as an oracle instead of a Conditional Independence (CI) Tester. We will prove an upper bound on the p-values of the edges obtained and show that p-value control is possible in this case under some weak assumptions.
DI Testing Oracle
: This DI testing function outputs the probability (or p-value) of the eventfor any given the dataset. We will first assume this oracle that outputs p-value to specify our MMPC-p algorithm.
Let us assume we have a measure of association
Define the functions max-min association and argmax-min association as follows:
We now describe the MMPC-p algorithm presented in Algorithm 1. It consists of two phases: the first phase picks candidate parents while the second prunes the list of candidate parents picked in the first phase.
If , for used in Algorithm 1.
The above says that Type II errors are small. Related to the faithfulness assumption, it means there are no very weak dependencies in the system. Similar assumptions have been made for p-value control for causal inference with i.i.d. data .
Type II error less than (Assumption 5) implies that after Phase I of MMPC-p.
We present a proof by contradiction. Suppose , then . This implies that by Assumption 5. Suppose is not included in and Phase I of MMPC-p completes. This implies that when looking at , there exists a subset such that . This implies that for that subset yielding a contradiction. Therefore, node will be included in at the end of Phase I of MMPC-p. ∎
Lemma 2 ().
Consider CI testers and the following null and alternative hypothesis
Assuming that the th CI oracle outputs independent of all other oracles, we can bound the p-value (5) as .
For all the edges that finally remain after Phase II of MMPC-p the
After completion of Phase I, we wish to test whether the edge is present by conducting independence tests. We construct a hypothesis test with the following null and alternative:
Since all parents are in , testing for in (6) is equivalent to testing in (5). Similarly, testing for in (6) is equivalent to testing in (5). Hence, the hypothesis test defined in (6) is equivalent to the hypothesis test defined in (5). Hence, our Algorithm 1, keeps track of the p-value by bounding . ∎
FDR Control:We define given by:
where is the number of edges retained at the end of Phase II of MMPC-p. The value satisfies:
Given a target false positive rate , deleting directed edges whose ensures consistent FDR control provided in Algorithm 1.
The MMPC-p algorithm with a perfect DI oracle is sound and complete.
A perfect DI oracle means that if and if . With this strong assumption and Lemma 1, for any node , after the first phase of MMPC-p.
Next, we show that if is not in then after the second phase. The reason is that one of the subsets of after the first phase has to equal (as shown in the previous paragraph). Suppose, , then we know that . Therefore, . This means that for any , at Line 1 when if after Phase I. This would cause to be discarded in the second phase. ∎
This algorithm is much simpler than the one it is inspired by: MMHC. The definition of DI and the role of time in its computation simplify the algorithm and its proof. Furthermore, we do not have problems of “descendants” staying after the two phases of the algorithm (there is a second part of the MMHC algorithm in the original paper where pairs were only considered if and , that is not required here). However, the algorithm still retains the robustness of MMHC.
V DI Testers for Linear Models and Gaussian Processes
Let the scalar variable follow a memory- autoregressive linear model with i.i.d. Gaussian noise given by where and is independent across time. Generalizing to a set of random variables with an underlying Granger causal graph (the DI graph) in the sense of Section III. Now given the DI graph i.e., for variable there is a set (that does not depend on ) such that where ’s are the coupling coefficients. Let denote the number of time points sampled for every variable . Let the number of i.i.d. copies of these time series is . Every variables essentially is observed times, across time for each i.i.d. sample. For jointly Gaussian autocorrelated time series processes, DI can be computed by . . Here, is the asymptotic prediction error of given the past of the process , i.e. and the past of process
Test 1: We follow the standard approach used in Granger causal studies . If is constant, we create matrix consisting of rows by taking all consecutive pairs from every time series stacking them vertically. Now, we stack these matrices vertically again to create an matrix . Let refer to the first column and let
refer to the second column. We solve the following two approximations through ordinary least squares regression:
Let be the mean squared error for the first least-squares approximation and be the mean squared error for the second approximation. Then follows a distribution with degree of freedom and the p-value corresponding to the null hypothesis corresponds to the p-value of the null hypothesis (when the process is stationary and jointly Gaussian). This is, therefore, a DI testing oracle and we call it Tester 1.
Test 2: The issue with Tester 1 is that it does regression with highly autocorrelated samples of the same time series stacked vertically. This is a good practice when the number of i.i.d. copies is small. However, when is comparable to , autocorrelation amongst a specific process would decrease the performance of the tester. Instead of regression through stacking as in the previous case, we compute the asymptotic prediction error as follows.
1) We do two separate regressions for each pair of time points with i.i.d. samples: one using variable as a covariate to predict and another without and only with set .
2) Now consider the residues obtained after the regressions as for the first regression. Similarly, let the residues for the second regression be .
3) Denote to be the covariance matrix whose entries are indexed by . is the covariance between and averaged over the i.i.d samples. Similarly, let is the covariance between matrix calculated from the variables. Now, since all variables are jointly Gaussian, the residues are also jointly Gaussian. Let be the covariance sub-matrix involving points with time index until . Therefore, we compute the asymptotic prediction error given by the expression : Similar expressions hold for . This is motivated by the fact that for jointly Gaussian variables , the squared prediction error of given the other is , where is the bottom right sub-matrix of . 4) Under the null hypothesis, is distributed with with degree of freedom. We call this Tester 2.
Vi Comparative Study
We perform a comparative study similar to reference . We compare the results of MMPC-p to modified GC [18, 5, 2] and modified PC [19, 5]. For MMPC-p and PC we use Tester 1 and Tester 2; for modified GC we use only Tester 1. We fix an value for all the testers. For the sake of brevity, we refer to modified GC and modified PC as GC and PC, respectively in the remainder of this paper.
Synthetic Datasets: We generate synthetic datasets as described in . For a given density and number of nodes , we generate 50 datasets consisting of directed graphs of nodes that contain at least one -cycle, with coefficients of the AR(1) model in
(before normalizing by the largest eigenvalue), such that the matrix has a density. This method of constructing AR(1) models will generate matrices with very small eigenvalues that is fixed by adding a scaled identity to the AR(1) model (essentially adding feedback loops ). We do this 50 times for each of and for densities . For each of the datasets we generate 1000 samples, in the form of time series with samples each, such that .
Metrics: We consider omission error rate: false negative edges normalized by the total number of edges and commission error rate: false positive edges normalized by the total number of non-edges.
Discussion: The leftmost plots in Figure 1, Figure 2, Figure 3, and Figure 4 indicate that for large and sparse graphs both the omission and commission errors are well controlled for MMPC-p. The rightmost plots suggest that when the density is higher, commission errors for MMPC-p are still well controlled but the omission error increases.
We conduct analysis for 50 variables separately in Figure 5. The results indicate that PC has high commission error with Tester 1, and GC cannot run because of the conditioning set being large. However, both commission and omission errors for MMPC-p with Tester 1 are lower than PC and GC.
Vii Global Development Case Study
We consider a dataset of over 4,000 random processes, most of which begin in 1960, across a wide range of development issue areas. The source is largely international organizations like the World Bank, the Food and Agriculture Organization of the United Nations, the UNESCO Institute for Statistics, and the International Monetary Fund and has been standardized (all series structured identically with detailed metadata) in the IFs platform, a free, open-source long-term global integrated assessment model. For each process in the dataset, we have 186 time series samples, each of them corresponding to a country. The length varies between 50 and 60 time steps.
We compare the results of MMPC-p with the causal connections used in the IFs model. The connections in IFs are the result of a deductive approach, i.e. domain knowledge based on academic literature and conceptually sound statistics. In Table I, we show the parents obtained for four series (selected arbitrarily): AGCropProductionFAO, GDPCurDol, LaborAgricultureTotMale, and Population.
|Series||Parent (max p-value)||Parent (IFs)|
|AGCropProductionFAO||AGCroptoFoodFAO (1e-9), Market for PC sales (2.22e-6)||Land Crop, Change in Precipitation, Annual Temperature Change, Land Equipped for Irrigation, Labor in Agriculture, Arable Land|
|GDP Current Dollars||GDP (1e-9)||Labor|
|LaborAgricultureTotal%Male||Revenue Contribution (3.1e-5)||Value added Agriculture|
|Population||Internet Subscribers (1e-9), Cooking Oil, Fuel and Coal (1e-9)||Birth, Death, Migration|
For the most part, the parents identified by MMPC-p do not match the causal drivers in IFs. They are a mixed bag: some are semantically similar to the IFs parents, such as AGCroptoFoodFAO and GDP; we have validated them to be semantically similar with domain experts. Others like Market for PC sales, Internet Subscribers, and Cooking Oil are spurious.
One of the main reasons for the mismatch between MMPC-p and the IFs model is that many variables used in the model do not have a direct corresponding data series. For example, one of the two direct drivers of crop production is yield, measured as tons per hectare. But the variable for yield used in the IFs model is initialized using data series for crop production and crop land (the quotient being yield). So, MMPC-p is unable to identify yield as a direct parent of crop production. Another reason for the mismatch is that the dataset used by MMPC-p contains many series that are aggregated for use in the IFs model, and are not directly causally connected to other variables. For example, calories per capita is an important development indicator, and a direct driver of hunger, but is initialized in the IFs model through the sum of ten series for calories per capita from different food sources. Finally, a technical reason for the mismatch could be that many of these relations are non-linear in nature. Other testers that do not require linearity could be used with more available data and perhaps yield results more similar to IFs.
In this paper, we have proposed a new algorithm for learning the Granger causal structure of observational time series and endowed it with strong FDR control. Named MMPC-p, it is inspired by the hill-climbing MMHC approach for causal structure learning in non-time series observations and inherits its scalability to large numbers of random processes. We conduct a comprehensive comparison to GC and PC with two different DI testers on large sparse graphs, finding that the proposed algorithm has better FDR control and scalability than the competing algorithms. We have also taken the first steps to using the algorithm in practice for a global development use case as an alternative to years-long modeling efforts. Our results are observed to be semantically similar for some variables when compared to the existing ground-truth. There is still room for improvement in better aligning with international studies practice; in fact, one piece of future work is to use the human-validated relationships not only as a comparison point for validating algorithm outputs, but as input for an improved algorithm that is a hybrid of deduction and data-driven inference.
-  B. B. Hughes and E. E. Hildebrand, Exploring and Shaping International Futures. New York: Paradigm, 2006.
-  L. Barnett and A. K. Seth, “The MVGC multivariate Granger causality toolbox: A new approach to Granger-causal inference,” Journal of Neuroscience Methods, vol. 223, pp. 50–68, Feb. 2014.
-  C. J. Quinn, N. Kiyavash, and T. P. Coleman, “Directed information graphs,” IEEE Transactions on Information Theory, vol. 61, no. 12, pp. 6887–6909, Dec. 2015.
-  M. Eichler, “Graphical modelling of multivariate time series,” Probability Theory and Related Fields, vol. 153, no. 1-2, pp. 233–268, Jun. 2012.
-  J. W. Cook, D. Danks, and S. M. Plis, “Learning dynamic structure from undersampled data,” in Proceedings of the UAI Causality Workshop, Sydney, Australia, Aug. 2017, p. 7.
-  E. V. Strobl, P. L. Spirtes, and S. Visweswaran, “Estimating and controlling the false discovery rate for the PC algorithm using edge-specific p-values,” arXiv preprint arXiv:1607.03975, May 2017.
-  A. P. Armen and I. Tsamardinos, “A unified approach to estimation and control of the false discovery rate in Bayesian network skeleton identification,” in
-  I. Tsamardinos, L. E. Brown, and C. F. Aliferis, “The max-min hill-climbing Bayesian network structure learning algorithm,” Machine Learning, vol. 65, no. 1, pp. 31–78, Oct. 2006.
-  A. Hyttinen, S. Plis, M. Järvisalo, F. Eberhardt, and D. Danks, “A constraint optimization approach to causal discovery from subsampled time series data,” International Journal of Approximate Reasoning, vol. 90, pp. 208–225, Nov. 2017.
-  S. Plis, D. Danks, C. Freeman, and V. Calhoun, “Rate-agnostic (causal) structure learning,” in Advances in Neural Information Processing Systems 28, Montréal, Canada, Dec. 2015, pp. 3303–3311.
M. Gong, K. Zhang, B. Schölkopf, C. Glymour, and D. Tao, “Causal discovery
from temporally aggregated time series,” in
Proceedings of the Conference on Uncertainty in Artificial Intelligence, Sydney, Australia, Aug. 2017, p. 269.
-  K. H. Brodersen, F. Gallusser, J. Koehler, N. Remy, and S. L. Scott, “Inferring causal impact using Bayesian structural time-series models,” Annals of Applied Statistics, vol. 9, no. 1, pp. 247–274, Mar. 2015.
-  A. C. Lozano, H. Li, A. Niculescu-Mizil, Y. Liu, C. Perlich, J. Hosking, and N. Abe, “Spatial-temporal causal modeling for climate change attribution,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, Jun.–Jul. 2009, pp. 587–596.
-  A. Chaudhry, P. Xu, and Q. Gu, “Uncertainty assessment and false discovery rate control in high-dimensional Granger causal inference,” in Proceedings of the International Conference on Machine Learning, Sydney, Australia, Aug. 2017, pp. 684–693.
-  J. Jiao, T. A. Courtade, K. Venkat, and T. Weissman, “Justification of logarithmic loss via the benefit of side information,” IEEE Transactions on Information Theory, vol. 61, no. 10, pp. 5357–5365, Oct. 2015.
-  P.-O. Amblard and O. J. J. Michel, “Relating Granger causality to directed information theory for networks of stochastic processes,” arXiv preprint arXiv:0911.2873, Nov. 2011.
-  C. W. J. Granger, “Investigating causal relations by econometric models and cross-spectral methods,” Econometrica, vol. 37, no. 3, pp. 424–438, Aug. 1969.
-  P. Spirtes, C. Glymour, and R. Scheines, Causation, Prediction and Search. Cambridge, MA: MIT Press, 2000.