1 Introduction
Causal inference from time series is a fundamental problem in data science, with applications ranging from economics, machine monitoring, biology to climate research. It is also a problem that to date has not found an overall resolution yet.
While Granger causality (Wiener, 1956; Granger, 1969, 1980) (see definition in supplementary. 7.1) has been the standard approach to causal analysis of time series data since half a century, several issues caused by violations of its assumptions (causal sufficiency, no instantaneous effects) have been described in the literature, see, e.g., Peters et al. (2017) and references therein. Several approaches addressing these problems have been proposed during the last decades (Hung et al., 2014; Guo et al., 2008). Nevertheless, it is fair to say that causal inference in time series is still challenging – despite the fact that the time order of variables renders it an easier problem than the typical ‘causal discovery problem’ of inferring the causal DAG among variables without any prior knowledge on causal directions (Pearl, 2009; Spirtes et al., 1993). The discovery of the causal graph from data is largely based on the graphical criterion of dseparation formalizing the set of conditional independences to be expected based on the causal Markov condition and causal faithfulness (Spirtes et al., 1993) (see definition in supplementary 7.1). One can show that Granger causality can be derived from dseparation (see, e.g., Theorem 10.7 in (Peters et al., 2017)). Several authors showed how to derive dseparation based causal conclusions in time series beyond Granger’s work. Entner & Hoyer (2010) and Malinsky & Spirtes (2018), for instance, are inspired by the FCI algorithm (Spirtes et al., 1993) and the work from Eichler (2007) without assuming causal sufficiency for causal discovery of the full graph in time series (for an extended review see (Runge, 2018; Runge et al., 2019a)). However, these methods, due to the hardness of their goal and due to the few assumptions they impose, are limited to report only Partially Ancestral Graphs (PAGs). Even for PAGs, the authors report that their implementation is not complete (Malinsky & Spirtes, 2018), meaning they will not identify all of them. In (Runge et al., 2019b) the PCMCI method is proposed as an extension of PC, and, although lower rates of false positives are reported compared to classical Granger causality (suppl. 7.1), the method still relies on the assumption of causal sufficiency. All the aforementioned methods focus on the full causal graph discovery. A method that focuses on the narrower problem that we tackle here  that of causal feature selection of a given target is seqICP of Pfister et al. (Pfister et al., 2019). Although the goal is the same as our method, seqICP relies on a quasiinterventional scenario with different background conditions, while our method is based on observational data. We give an extensive comparison of the methods in subsection 5.4.
In the present work, we study the problem of causal feature selection in time series. By this we mean the detection of direct and indirect causes of a given target. We construct conditions which, subject to appropriate connectivity assumptions, we prove to be sufficient for direct and indirect causes and necessary for the identification of direct causal time series of a target, even in the presence of latent variables. In contrast to approaches inspired by conditional independence based algorithms from causal discovery (like PC and FCI (Spirtes et al., 1993)), our method directly constructs the right conditioning sets of variables, without searching over a large set of possible combinations. It does so with a preprocessing step that identifies the nodes of the time series that enter the previous time step of the target node, thus avoiding statistical issues of multiple hypothesis testing.
We provide experimental results with simulated data, examining scenarios with various number of observed and hidden time series, density of edges, noise levels, multiple time lags and sample sizes. We show that our method leads to essentially no false positives and relatively low false negative rates, even in confounded environments, thus outperforming Granger causality, as well as seqICP and PCMCI. We also succeed meaningful results on read data. We refer to our method as SyPI as it performs a Systematic Path Isolation approach for causal feature selection in time series.
2 Theory and Methods
We are given observations from a target time series whose causes we wish to find, and observations from a multivariate time series of potential causes (candidate time series). Moreover, we allow an unobserved multivariate time series , which may act as common cause of the observed ones. The system consisting of and is not assumed to be causally sufficient, hence we allow for unobserved variables . We introduce the following terminology to describe the causal relations between :
TerminologyNotation:

“full time graph”: the infinite DAG having and as nodes.

“summary graph” is the directed graph with nodes containing an arrow from to for whenever there is an arrow from to for . (Peters et al., 2017)

“” for means a directed path that does not include any intermediate observed nodes in the full time graph (confounded or unconfounded).

“” for in the full time graph means a directed path from to .

“confounding path”: A confounding path between and in the full time graph is a path of the form , consisting of two directed paths and a common cause of and .

“confounded path”: an arbitrary path between two nodes and in the full time graph which coexists with a confounding path between and .

“sgunconfounded” (summarygraphunconfounded) causal path: A causal path in the full time graph that does not appear as a confounded path in the summary graph

“pbunconfounded” (pastblockedunconfounded) causal path: A causal path between two nodes and in the full time graph for which all confounding paths are blocked by or .

“lag”:
is a lag for the ordered pair of a time series
and the target if there exists a colliderfree path    that does not contain a link of this form with arbitrary, for any and any node in this path does not belong to . See explanatory figure 8 in supplementary. 
“singlelag dependencies”: We say that a set of time series () have “singlelag dependencies” if all the have only one lag for each pair . Otherwise we refer to “multiplelag dependencies”.
Assumptions:

Causal Markov condition ^{1}^{1}1def. 1.2.2 (Pearl, 2009), see definition in suppl. 7.1in the full time graph

Causal Faithfulness in the full time graph ^{2}^{2}2see definition in suppl. 7.1

No backward arrows in time

Stationary full time graph: the full time graph is invariant under a joint time shift of all variables

The full time graph is acyclic.

The target time series is a sink node in the summary graph; it does not affect any other variables in the graph.

There is an arrow . Note that arrows need not exit, we then call memoryless.

There are no arrows for .

Every variable that affects directly (no intermediate observed nodes in the path in the summary graph) should be memoryless (), and should have singlelag dependencies with in the full time graph (see def. 2).^{3}^{3}3(Note that assumption is required only for the completeness of the algorithm against direct false negatives (Theorem 2). For the sufficient conditions alone, its violation does not spoil Theorem 1. The existence of a latent variable with memory affecting the target time series directly, or of a latent variable affecting directly the target with multiple lags renders impossible the existence of a conditioning set that could dseparate the future of the target variable and the past of any other observed variable.)
Intuition for proposed conditions in Theorems 1 and 2:
The idea is to isolate the path   in the full time graph, and extract triplets as in (Mastakouri et al., 2019). This way we can exploit the fact that if there is a confounding path between and , then will be a collider that will unblock the path between and when we condition on it. In this path “ ” means or and (if observed) in addition to any other intermediate variable in the path   must . Mastakouri et al. (2019) proposed sufficient conditions for causal feature selection in a DAG (no timeseries) where a cause of a potential cause was known or could be assumed due to timeordered pair of variables.
Here the goal is to propose necessary and sufficient conditions that will differentiate between being a common cause or   being a (in)direct edge to in the full time graph.
Figure 1 visualizes why timeseries raise an additional challenge for identifying sgunconfounded causal relations. While the influence of on is unconfounded in the summary graph, the influence is confounded in the full time graph due to its own past; for example and are confounded by . Therefore we need to condition on to remove past dependencies. If no other time series were present, that would be sufficient. However, in the presence of other time series affecting the target , becomes a collider that unblocks dependencies. If for example we want to examine as a candidate cause, we need first to condition on , the past of the . Following, we need to condition to one node from each time series that enter (which is a collider) to avoid all the dependencies that might be created by conditioning on it. It is enough to condition only on these nodes for the following reason: If a node has a lagdependency with , then there is an (un)directed path from to . If this path is a confounding one, then conditioning on is not necessary, but also not harmful, because the future of this time series in the full graph is still independent of . This independence is forced by the fact that the is a collider because of the stationarity of graphs and this collider is by construction not in the conditioning set. If is connected with via a directed link (as in fig. 1), then conditioning on is necessary to block the parallel path created by its future values . Based on this idea of isolating the path of interest, we build the conditioning set as described in Theorem 1 and its converse Theorem 2, where we prove the necessity and sufficiency of our conditions.
Theorem 1.
[Sufficient conditions for identifying a direct or indirect cause of ] Assuming 22, let be the minimum lag (see 2) between and . Further, let . Then, for every time series we define a conditioning set .
If
(1) 
and
(2) 
are true, then
and the path between the two nodes is pbunconfounded.
We can think of as the set that contains only one node from each time series and this node is the one that enters the node due to a directed or confounded path (if exists then the node is the one at ).
Proof.
(Proof by contradiction)
We need to show that if or if the path is pbconfounded then at least one of the conditions 1 and 2 is violated.
First assume that there is no directed path between and : . Then, there is a confounding path without any colliders. (Colliders cannot exist in the path by the definition of the lag 2.) In that case we will show that either condition 1 or 2 is violated. If all the existing confounding paths contain an observed confounder (there can be only one confounder since in this case there are no colliders in the path), then condition 1 is violated, because we condition on which dseparates and . If in all the existing confounding paths the confounder node but some observed noncollider node is in the path and this node belongs to , then condition 1 is violated, because we condition on which dseparates and . If there is at least one confounding path and its confounder node does no belong in and no other observed (noncollider or descendant of collider) node which is in the path belongs in then condition 2 is violated for the following reasons: Let’s name . We know the existence of the path , due to assumption 2.

If and have in common, then is a collider. Therefore, adding in the conditioning set would unblock the path between and .

If and have in common, that means lies on . In this case is not in the path from to and hence adding to the conditioning set could not dseparate and .
In both cases condition 2 is violated.
Remark.
Theorem 2.
[Necessary conditions for a direct cause of ] (almost converse of Theorem 1)
Proof.
(Proof by contradiction)
Assume that the direct path exists and it is unconfounded. Then, condition 1 is true. Now assume that condition 2 does not hold.
This would mean that the set does not dseparate and . Note that a path is said to be dseparated by a set of nodes in if and only if contains a chain or a fork such that the middle node is in , or if contains a collider such that neither the middle node nor any of its descendants are in the .
Hence, a violation of condition 2 would imply that (a) there is some middle node or descendant of a collider in and no noncollider node in this path belongs to this set, or (b) that there is a colliderfree path between and that does not contain any node in .

There is some middle node or descendant of a collider in and no noncollider node in this path belongs to this set:
(a1:) If there is at least one path     where is a middle node of a collider and none of the noncollider nodes in the path belongs to : Such a path could be formed only if in addition to some directly caused . Then  . (Due to our assumption for singlelag dependencies (see 2) a path of the form   could not exist). Then, due to stationarity of graphs the node will enter . If this is hidden (), then due to assumption 2 this time series will be memoryless (). Therefore, the collider in the conditioning set will not unblock any path between and that could contain . If is observed () then due to assumption 2 the path will be  . However, this path is always blocked by due to the rule we use to construct . That means a noncollider node in the conditioning set will necessarily be in the path , which contradicts the original statement.(a2:) If there is at least one path    where is a middle node of a collider and none of the noncollider nodes in the path belongs to : This could only mean that there is a confounder between the target and . However this contradicts that is an “sgunconfounded” direct causal path.
(a3:) If there is at least one path    where with is a middle node of a collider and no noncollider node in the path belongs to : In this case, because . By construction of all the observed nodes in that enter the node belong in . That means that enters the node . Hence, in the path will necessarily be a noncollider node which belongs to the conditioning set. This contradicts the original statement “and no noncollider node in the path belongs to ”.
(a4:) If a descendent of a collider in the path     belongs to the conditioning set and no noncollider node in the path belongs to it: Due to the singlelag dependencies assumption, otherwise there are multiplelag effects from to . That means that, independent of being hidden or not, the in the collider path will enter the node . If then because enter the node , . In the first case only and in the latter case also are a noncollider variable in the path that belongs to the conditioning set, which contradicts the statement of (a4). If the collider , as explained in (a3) at least one noncollider variable in the path will belong in the conditioning set, which contradicts the statement (a4). Finally, if and are hidden, if then the node is necessarily in the path as a passthrough node, which contradicts the statement (a4). If then the singlelag dependencies assumption is violated.

There is a colliderfree path between and that does not contain any node in :
Such a path would imply the existence of a hidden confounder between and or the existence of a direct edge from to . The former cannot exist because we know that is an sgunconfounded direct cause of . The latter would imply that there are multiple lags of direct dependency between and which contradicts our assumption for singlelag dependencies.
Therefore we showed that whenever is an sgunconfounded causal path, the conditions are necessary. ∎
Since it is unclear how to identify the lag in 2, we introduce the following lemmas for the detection of the minimum lag that we require in the theorems. We provide the proofs of the lemmas in suppl. 7.3).
Lemma 1.
If the paths between and are directed then the minimum lag as defined in 2 coincides with the minimum nonnegative integer for which . The only case where is when there is a confounding path between and that contains a node from a third time series with memory. In this case .
Lemma 2.
Using the condition in Lemma 1
via lasso regression and the two conditions in Theorems
1 and 2 we build an algorithm to identify direct and indirect causes on time series. The input is a 2D array(candidate time series) and a vector
(target), and the output a set with indices of the time series that were identified as causes. Python code is provided in the supplementary. The complexity of our algorithm is for candidate time series, assuming constant execution time for the conditional independence test.3 Experiments
3.1 Simulated experiments: time series construction
To test our method, we build simulated time series with various hidden variables, always respecting the aforementioned assumptions. We sampled 100 random graphs for the following tuples of hyperparameters: (# samples, # hidden variables, # observed variables, density of edges between candidate time series, density of edges between time series and target series, noise variance). We then calculate the false positive (FPR) and false negative rates (FNR) for these 100 random graphs. The possible values for each hyperparameter in the tuple are the following: # samples
, # hidden variables , # observed variables , Bernoulli() existence of edge between candidate time series , Bernoulli() existence of edge between candidate time series and target series and noise variance. During the construction of the time series, every time step is calculated as the weighted sum of the previous step of all the incoming time series, including the previous step of the current time series. The weights of the adjacent matrix between the time series are selected from a uniform distribution in the range
if they have not been set to zero (we thus prevent too weak edges, which would result in almost nonfaithfulness distributions that render the problem of detecting causes impossible).The two conditional independence tests are calculated with partial correlation, since our simulations are linear, but there is no restriction for nonlinear systems (see extension in 5.2). For the “lag” calculation step of our method, we use lasso in a bivariate form between each node in in the summary graph and (for the nonlinear case lasso could be replaced with a nonlinear regressor). We did some exploratory search across different values for the regularization parameter and the threshold on the coefficients of this step. We found out that for regularization and mostly any threshold in the region 0.1 to 0.15 for the returned coefficients of lasso, the results are stable. So we fixed these two parameters once before running the experiments, without readjusting them for the different types of graphs.
For all the above experiments we simulated the time series with unique direct lag of 1. Although our theory is complete against false negatives only for singlelag dependencies, we wanted to test the performance of our method even in the presence of multiple lags. Therefore we examined the performance for 4 and 5 observed, 1 additional hidden and 1 target time series, for 2, 3 and 4 coexisting lag direct effects. We decide for the existence of a lag sampling from a Bernoulli distribution with
.We now compare our algorithm to LassoGranger (Arnold et al., 2007) for 2 hidden and 3, 4 and 5 observed time series. Our algorithm operates with two thresholds for the values of the two tests, one (threshold1) for rejecting independence in the first condition, and a second (threshold2) for accepting dependency in the seccond condition. LassoGranger (Arnold et al., 2007) operates with one hyperparameter: the regularization parameter . To ensure a fair comparison, we tuned the parameter for LassoGranger (not our method) such as to allow it at least the same FNR as our method, for same type of graphs. We did not do the comparison based on matching FPR, because, LassoGranger generates many FPs in the presence of hidden confounders, and this would not change the ordering of the ROC curves (note that we optimize LassoGranger for each FNR). For all the aforementioned experiments apart from the comparison of the two methods, we used threshold1 and threshold2. We produced ROC curves for the two methods as follows: for LassoGranger, we varied the parameter across . For our method (SyPI), we varied only threshold1 and threshold2, keeping their ratio equal to , using values in . Note that in our simulations we do not constrain our graphs to be acyclic in the summary graph.
Finally, for the last part of our simulated experiments we compare our method against seqICP (Pfister et al., 2019) and against PCMCI (Runge et al., 2019b). We performed ten experiments with 20 random graphs each, with 2 to 6 observed and 1 to 2 hidden series, for sample size 2000 and medium density. For our method we kept the same thresholds, as we defined above: threshold1 and threshold2.
3.2 Experiments on realdata
Finally, we examine the performance of our method on real data, where we have no guarantee that our assumptions hold true. We use the official recorded prices of dairy products in Europe (EU, ) (data provided in the suppl. as well). The target of our analysis is ’Butter’. According to the manufacturing process of dairy products as described in Soliman & Mashhour (2011), we know that the first material for butter manufacturing is ’Raw Milk’ and also that butter is not used as ingredient for the other dairy products in the list. Therefore, we can hypothesize that the direct cause of Butter prices is the price of Raw Milk, and that the other nodes in the graph (other cheese, WMP, SMP, Whey Powder) are not causing Butter’s price. We examine three countries, two of which provide data for the “Raw Milk” (Germany ’DE’ (8 tim series) and Ireland ’IE’ (6 time series)) and one where these values are not provided (United Kingdom ’UK’ (4 time series)). This last dataset was on purpose selected as this would be a good realistic scenario of a hidden confounder. In that case our method must not identify any cause. As we have extremely low sample sizes (<180) identifying dependencies is particularly hard. For that reason we set 0 threshold on our lag detector and the threshold1 at for accepting dependence in the first condition. We leave anything else unchanged as in the simulation experiments.
4 Results
4.1 FPR and FNR for various densities and graph size
First, we wanted to examine the performance of our method for various density of edges among the candidate series, and between the candidates and the target time series (see 3.1). In fig.22 we present results for a medium noise level (20%) and for sample sizes 500, 1000, 2000 and 3000. The FPR are very small () for sample size and the results are similar for larger or smaller noise levels (see suppl.). Here, we present results for 1, 4 and 8 observed time series, 1 additional hidden and 1 target, to show how the graph size affects the rates.
With red color in each cell we present the percentage of the FNR that corresponds to the direct causes that were missed, since our method is complete for direct only. Since our claims refer to complete conditions for unconfounded direct causes, we also encounter as false positives the confounded direct causes. Overall, we see that our algorithm performs with almost zero FPR independent of the noise, the density or the size of the graphs. FNR are low for the direct causes starting from 16% for small and sparse graphs and not exceeding 45% for very large and dense graphs.
The results for different number of observed time series and noises are presented in the supplementary 7.5.
4.2 FPR and FNR for varying # of hidden nodes
In fig. 3 we present the behaviour of our algorithm in moderately dense graphs, for 2000 sample size, 20% noise variance and different number of hidden variables. We can see that the FPR is close to zero, independent of the number of hidden variables. Although the total FNR increases with the number of time series, the percentage that corresponds to direct causes ranges just from 30 to 40%. Results are similar for different densities of edges (see suppl. 7.5.2).
4.3 FPR and FNR for “multiplelag dependencies”
For Bernoulli probability of existence of an edge between the time series (
) and the time series and the target () we calculate FPR and FNR for different number of lags that can exist between the time series.We fix the number of samples at 2000, and the noise variance at 20%, for 2, 3 and 4 lags. We examine the above combination for moderately large graphs with 1 hidden, 1 target and 5 observed time series. As depicted in fig. 4, our method seems to perform very well in terms of FPR, independent of the number of coexisting lags between the time series. As our method is complete only for singlelag dependencies, we don’t expect very low FNR. However, we see that the FNR that refer to direct causes only does not exceed 45%.
4.4 Comparison against LassoGranger causality
We compare the performance of our method against the commonly used LassoGranger method. We examine the performance of the two methods for relatively dense graphs, for 2 hidden, 1 target and 3,4 or 5 observed time series. In fig. 6 we see that even in such confounded graphs our method always performs with almost zero FPR, while LassoGranger reaches up to 16%, for similar or even larger FNR.
In fig. 5 we present the ROC curve for the performance of SyPI and LassoGranger for the same graphs. Since our method functions with two conditions and two pvalues, we did not manage to find logical pairs of thresholds that increase further the FPR. We see that at all operating points our method outperforms LassoGranger.
4.5 Comparison against seqICP and PCMCI
As it can be seen in figure 7, our method SyPI outperforms both methods for all type of full time graphs, yielding FPR and FNR between and . SeqICP yielded up to FPR and around FNR for almost all the graphs. This result is not surprising, as with hidden confounders seqICP will detect only a subset of the ancestors AN(Y), and in addition, it assumes that interventions exist in the input dataset. PCMCI yielded up to FPR and oscillated around FNR. It terms of performance times, SyPI was the fastest, followed by PCMCI; seqICP was rather slow for more than 5 time series.
4.6 Experiments on real data: Dairy product prices
We applied our algorithm on the dairyproduct price datasets for ’DE’ (8 time series), ’IE’ (6 time series) and ’UK’ (6 time series). The sample sizes are just 178, 113 and 168 accordingly. Data are depicted in fig. 9a, 9b, 9c in the suppl. Our algorithm successfully identifies ’Raw Milk’ as the direct cause of ’Butter’ in the ’IE’ dataset, correctly rejecting all the other 4 nodes ( TPR, TNR). In the ’DE’ dataset ’Raw Milk’ is correctly identified and there is only one false positive (’Edam’); all the other 6 nodes are successfully rejected ( TPR and TNR). Even more importantly, in the ’UK’ dataset where there are no provided measurements for ’Raw Milk’ prices (hidden confounder), our algorithm does not identify any cause ( TNR).
5 Discussion
5.1 Efficient conditioning set in terms of minimum asymptotic variance
In contrast to other approaches (like PC and FCI), our method does not search over a large set of possible combinations to identify the right conditioning sets. Instead, for each potential cause it directly constructs its ‘separating set’ for the nodes and (condition 2), from a preprocessing step that identifies () the nodes of the time series that enter the previous node of . The resulting conditioning set contains therefore covariates that enter the outcome node , and not the potential cause . Adjustment sets that include parents of the potential cause node are considered inefficient in terms of asymptotic variance of the total effect, as the parents of that node can be strongly correlated with it (Henckel et al., 2019). On the other hand, adding nodes that explain variance in the outcome node (precision variables) can be beneficial. According to Theorem 3.1. of Henckel et al. (2019) our conditioning set has a smaller asymptotic variance compared to a set that would include incoming nodes to or
. Therefore, our choice of conditioning set also contributes to a reasonable signal to noise ratio for the dependences under consideration. This could strengthen the statistical outcome of the conditional independence test.
5.2 Linear and nonlinear systems
In theory and in practice, our proposed algorithm can be used for both linear and nonlinear relationships between the time series. For the linear case a partial correlation test is sufficient to examine the conditional dependencies, while in the nonlinear case KCI (Zhang et al., 2012), KCIPT (Doran et al., 2014) or FCIT (Chalupka et al., 2018) could be used.
5.3 Multiplelag effects
Although our algorithm performs equally well for FPR in simulations with “multiplelag dependencies”, our theory is necessary only for “singlelag dependencies” (see 2). We could allow for “multiplelag dependencies” if we were willing to condition on larger sets of nodes, which we do not find acceptable for statistical reasons. Right now we require one node the most from each observed time series for the conditioning set. In a naive approach, coexisting time lags would require nodes from each time series in the conditioning set, but the theory is getting cumbersome. As a future work (as necessary conditions for multilags are out of the scope of this paper), in the suppl. 7.6 we show how only in the multilag bivariate case (one candidate, one target), where hidden confounders are memoryless, it is still possible to have sufficient and necessary conditions, subject to some extensions in the theory.
5.4 Comparison with related work
Pfister et al. (2019), have similar goal with us, that of causal feature selection, however, their method seqICP relies on a quasiinterventional scenario with different background conditions. They require sufficient interventions in the dataset, which should particularly affect only the input and not the target. SeqICP predicts the ancestors of target Y, (AN(Y)). In the presence of hidden confounders, the authors report that under further assumptions they will detect a subset of AN(Y), only if the dataset contains sufficient interventions on the predictors. Given our assumptions, we prove that our method SyPI will detect all the parents of Y (not only a subset), even in presence of latent confounders, without requiring interventions in the dataset. Our method’s complexity is , while seqICP is for sparse graphs. Our main concern with this method is the hard requirement for interventions in the dataset; something that would require randomizedcontrolled trials (RCTs), which in case that they are possible, of course render any causal inference method less relevant, as they can directly lead to causal conclusions. Our method aims to causal inference based solely on observations. Runge et al. (2019b) proposed PCMCI, which focuses on the problem of full graph causal discovery. Obviously our aim for causal feature selection is narrower, but still nontrivial. Nevertheless, their algorithm is easily adaptable to detect causes of a given target. Furthermore, Runge et al. (2019b) for this method assume causal sufficiency, which we find particularly hard to meet in real datasets. As shown in Fig. 7 our method outperforms both seqICP and PCMCI. Malinsky & Spirtes (2018) proposed SVARFCI, which focuses on the full graph discovery in time series data. As we mention above, their aim is broader, while we focus on detection of causes of a given target. SVARFCI returns PAGs, which is justified given the hard task and the limited assumptions the authors make to restrict the graph. On the other hand, our method returns the exact edges from causes to target. SVARFCI is computationally intensive with exhaustive conditional independence tests for all lags and conditioning sets. SyPI calculates in advance both for each conditional independence, significantly reducing testing. Also, SVARFCI, as its authors mention, is not complete even for PAGs, while SyPI is proven to be complete against false negatives given our assumptions. Finally, as we describe in detain in 5.1, our conditioning set is efficient in terms of minimum asymptotic variance; this does not hold for SVARFCI.
5.5 Technical assumptions of SyPI
Although our technical assumptions are many, we do not consider them extreme, given the hardness of the problem of hidden confounding, nor hard to be met. Entner & Hoyer or Malinsky & Spirtes don’t need these assumptions, as they exhaustively perform conditional independence tests for all lags and time series. Assumptions 2 and 2 assure that are time series with dependency form their previous time step. Our assumption 2 tackles the serious problem that autolag hidden confounders create. This problem is also stressed by Malinsky & Spirtes Malinsky & Spirtes (2018): “autolag confounders can be particularly problematic since they induce infinitelag associations” and by not taking assumptions for it, they are limited to find only PAGs.
6 Conclusion
Causal feature selection in time series is a fundamental problem in several fields ranging from biology, economics to climate research. In these fields often the causes of a time series of interest (i.e. revenue, temperature) need to be identified from a pool of candidate time series, while latent variables cannot be excluded. Here we target this problem by constructing conditions which we prove to be necessary for direct causes in singlelag dependency graphs, even in the presence of latent variables, and sufficient for direct and indirect causes in multilag dependency graphs. To the best of our knowledge, our method SyPI is the first complete and sound algorithm (subject to appropriate graphical assumptions) for direct causal feature selection in time series that does not assume causal sufficiency and acyclic summary graph, thus overcoming the shortcomings of Granger Causality. Our simulation results show that for a range of graph sizes and densities, SyPI outperforms LassoGranger. FPR are essentially zero for sample sizes , and the FNR for direct causes never exceeds at dense large graphs. Moreover, in experiments on real data, where most of our assumptions could potentially be violated and where the sample sizes are particularly small for the graph size, SyPI performed with almost true positive and negative rate.
References
 Arnold et al. (2007) Arnold, A., Liu, Y., and Abe, N. Temporal causal modeling with Graphical Granger Methods. pp. 66–75, 2007.
 Chalupka et al. (2018) Chalupka, K., Perona, P., and Eberhardt, F. Fast conditional independence test for vector variables with large sample sizes. ArXiv, abs/1804.02747, 2018.

Doran et al. (2014)
Doran, G., Muandet, K., Zhang, K., and Schölkopf, B.
A permutationbased kernel conditional independence test.
In
Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence
, pp. 132–141, 2014.  Eichler (2007) Eichler, M. Causal inference from time series: What can be learned from Granger causality. In Proceedings of the 13th International Congress of Logic, Methodology and Philosophy of Science, pp. 1–12. King’s College Publications London, 2007.
 Entner & Hoyer (2010) Entner, D. and Hoyer, P. O. On causal discovery from time series data using FCI. Probabilistic graphical models, pp. 121–128, 2010.
 (6) EU. European union prices of dairy products. https://ec.europa.eu/info/foodfarmingfisheries/farming/factsandfigures/markets/prices/pricemonitoringsector/.
 Granger (1969) Granger, C. W. J. Investigating causal relations by econometric models and crossspectral methods. Econometrica, 37:424–438, 1969.
 Granger (1980) Granger, C. W. J. Testing for causality, a personal viewpoint., volume 2. 1980.
 Guo et al. (2008) Guo, S., Seth, A. K., Kendrick, K. M., Zhou, C., and Feng, J. Partial granger causalityEliminating exogenous inputs and latent variables. Journal of Neuroscience Methods, 172(1):79 – 93, 2008.
 Henckel et al. (2019) Henckel, L., Perković, E., and Maathuis, M. H. Graphical criteria for efficient total effect estimation via adjustment in causal linear models. arXiv, 2019.
 Hung et al. (2014) Hung, Y.C., Tseng, N.F., and Balakrishnan, N. Trimmed granger causality between two groups of time series. Electron. J. Statist., 8(2):1940–1972, 2014.

Malinsky & Spirtes (2018)
Malinsky, D. and Spirtes, P.
Causal structure learning from multivariate time series in settings
with unmeasured confounding.
In Proceedings of 2018 ACM SIGKDD Workshop on Causal
Disocvery, volume 92 of
Proceedings of Machine Learning Research
, pp. 23–47, 2018.  Mastakouri et al. (2019) Mastakouri, A., Schölkopf, B., and Janzing, D. Selecting causal brain features with a single conditional independence test per feature. In Advances in Neural Information Processing Systems 32, 2019.
 Pearl (2009) Pearl, J. Causality. Cambridge University Press, 2nd edition, 2009.
 Peters et al. (2017) Peters, J., Janzing, D., and Schölkopf, B. Elements of Causal Inference  Foundations and Learning Algorithms. Adaptive Computation and Machine Learning Series. The MIT Press, Cambridge, MA, USA, 2017.
 Pfister et al. (2019) Pfister, N., Bühlmann, P., and Peters, J. Invariant causal prediction for sequential data. Journal of the American Statistical Association, 114(527):1264–1276, 2019.
 Runge (2018) Runge, J. Causal network reconstruction from time series: From theoretical assumptions to practical estimation. Chaos: An Interdisciplinary Journal of Nonlinear Science, 28(7):075310, 2018.
 Runge et al. (2019a) Runge, J., Bathiany, S., Bollt, E., CampsValls, G., Coumou, D., Deyle, E., Glymour, C., Kretschmer, M., Mahecha, M. D., MuñozMarí, J., van Nes, E. H., Peters, J., Quax, R., Reichstein, M., Scheffer, M., Schölkopf, B., Spirtes, P., Sugihara, G., Sun, J., Zhang, K., and Zscheischler, J. Inferring causation from time series in earth system sciences. Nature Communications, pp. 2553, 2019a.
 Runge et al. (2019b) Runge, J., Nowack, P., Kretschmer, M., Flaxman, S., and Sejdinovic, D. Detecting and quantifying causal associations in large nonlinear time series datasets. Science Advances, 5(11):eaau4996, 2019b.
 Soliman & Mashhour (2011) Soliman, I. and Mashhour, A. Dairy marketing system performance in egypt. 01 2011.
 Spirtes et al. (1993) Spirtes, P., Glymour, C., and Scheines, R. Causation, Prediction, and Search. 1993.
 Wiener (1956) Wiener, N. The theory of prediction, Modern mathematics for the engineer, volume 8. 1956.
 Zhang et al. (2012) Zhang, K., Peters, J., Janzing, D., and Schölkopf, B. Kernelbased conditional independence test and application in causal discovery. UAI, 2012.
Comments
There are no comments yet.