Necessary and sufficient conditions for causal feature selection in time series with latent common causes

05/18/2020
by   Atalanti A. Mastakouri, et al.
4

We study the identification of direct and indirect causes on time series and provide necessary and sufficient conditions in the presence of latent variables. Our theoretical results and estimation algorithms require two conditional independence tests for each observed candidate time series to determine whether or not it is a cause of an observed target time series. We provide experimental results in simulations, where the ground truth is known, as well as in real data. Our results show that our method leads to essentially no false positives and relatively low false negative rates, even in confounded environments with non-unique lag effects, outperforming the common method of Granger causality.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

10/11/2020

Granger causality of bivariate stationary curve time series

We study causality between bivariate curve time series using the Granger...
12/20/2021

Detection of causality in time series using extreme values

Consider two stationary time series with heavy-tailed marginal distribut...
07/03/2020

High-recall causal discovery for autocorrelated time series with latent confounders

We present a new method for linear and nonlinear, lagged and contemporan...
03/04/2021

Non-Asymptotic Guarantees for Robust Identification of Granger Causality via the LASSO

Granger causality is among the widely used data-driven approaches for ca...
02/23/2021

When is Early Classification of Time Series Meaningful?

Since its introduction two decades ago, there has been increasing intere...
05/16/2018

On the Convergence of the SINDy Algorithm

One way to understand time-series data is to identify the underlying dyn...
08/16/2018

Switching Regression Models and Causal Inference in the Presence of Latent Variables

Given a response Y and a vector X = (X^1, ..., X^d) of d predictors, we ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Causal inference from time series is a fundamental problem in data science, with applications ranging from economics, machine monitoring, biology to climate research. It is also a problem that to date has not found an overall resolution yet.

While Granger causality (Wiener, 1956; Granger, 1969, 1980) (see definition in supplementary. 7.1) has been the standard approach to causal analysis of time series data since half a century, several issues caused by violations of its assumptions (causal sufficiency, no instantaneous effects) have been described in the literature, see, e.g., Peters et al. (2017) and references therein. Several approaches addressing these problems have been proposed during the last decades (Hung et al., 2014; Guo et al., 2008). Nevertheless, it is fair to say that causal inference in time series is still challenging – despite the fact that the time order of variables renders it an easier problem than the typical ‘causal discovery problem’ of inferring the causal DAG among variables without any prior knowledge on causal directions (Pearl, 2009; Spirtes et al., 1993). The discovery of the causal graph from data is largely based on the graphical criterion of d-separation formalizing the set of conditional independences to be expected based on the causal Markov condition and causal faithfulness (Spirtes et al., 1993) (see definition in supplementary 7.1). One can show that Granger causality can be derived from d-separation (see, e.g., Theorem 10.7 in (Peters et al., 2017)). Several authors showed how to derive d-separation based causal conclusions in time series beyond Granger’s work. Entner & Hoyer (2010) and Malinsky & Spirtes (2018), for instance, are inspired by the FCI algorithm (Spirtes et al., 1993) and the work from Eichler (2007) without assuming causal sufficiency for causal discovery of the full graph in time series (for an extended review see (Runge, 2018; Runge et al., 2019a)). However, these methods, due to the hardness of their goal and due to the few assumptions they impose, are limited to report only Partially Ancestral Graphs (PAGs). Even for PAGs, the authors report that their implementation is not complete (Malinsky & Spirtes, 2018), meaning they will not identify all of them. In (Runge et al., 2019b) the PCMCI method is proposed as an extension of PC, and, although lower rates of false positives are reported compared to classical Granger causality (suppl. 7.1), the method still relies on the assumption of causal sufficiency. All the aforementioned methods focus on the full causal graph discovery. A method that focuses on the narrower problem that we tackle here - that of causal feature selection of a given target- is seqICP of Pfister et al. (Pfister et al., 2019). Although the goal is the same as our method, seqICP relies on a quasi-interventional scenario with different background conditions, while our method is based on observational data. We give an extensive comparison of the methods in subsection 5.4.

In the present work, we study the problem of causal feature selection in time series. By this we mean the detection of direct and indirect causes of a given target. We construct conditions which, subject to appropriate connectivity assumptions, we prove to be sufficient for direct and indirect causes and necessary for the identification of direct causal time series of a target, even in the presence of latent variables. In contrast to approaches inspired by conditional independence based algorithms from causal discovery (like PC and FCI (Spirtes et al., 1993)), our method directly constructs the right conditioning sets of variables, without searching over a large set of possible combinations. It does so with a pre-processing step that identifies the nodes of the time series that enter the previous time step of the target node, thus avoiding statistical issues of multiple hypothesis testing.

We provide experimental results with simulated data, examining scenarios with various number of observed and hidden time series, density of edges, noise levels, multiple time lags and sample sizes. We show that our method leads to essentially no false positives and relatively low false negative rates, even in confounded environments, thus outperforming Granger causality, as well as seqICP and PCMCI. We also succeed meaningful results on read data. We refer to our method as SyPI as it performs a Systematic Path Isolation approach for causal feature selection in time series.

2 Theory and Methods

We are given observations from a target time series whose causes we wish to find, and observations from a multivariate time series of potential causes (candidate time series). Moreover, we allow an unobserved multivariate time series , which may act as common cause of the observed ones. The system consisting of and is not assumed to be causally sufficient, hence we allow for unobserved variables . We introduce the following terminology to describe the causal relations between :

Terminology-Notation:

  1. “full time graph”: the infinite DAG having and as nodes.

  2. “summary graph” is the directed graph with nodes containing an arrow from to for whenever there is an arrow from to for . (Peters et al., 2017)

  3. ” for means a directed path that does not include any intermediate observed nodes in the full time graph (confounded or unconfounded).

  4. ” for in the full time graph means a directed path from to .

  5. “confounding path”: A confounding path between and in the full time graph is a path of the form , consisting of two directed paths and a common cause of and .

  6. “confounded path”: an arbitrary path between two nodes and in the full time graph which co-exists with a confounding path between and .

  7. “sg-unconfounded” (summary-graph-unconfounded) causal path: A causal path in the full time graph that does not appear as a confounded path in the summary graph

  8. “pb-unconfounded” (past-blocked-unconfounded) causal path: A causal path between two nodes and in the full time graph for which all confounding paths are blocked by or .

  9. “lag”:

    is a lag for the ordered pair of a time series

    and the target if there exists a collider-free path - - - that does not contain a link of this form with arbitrary, for any and any node in this path does not belong to . See explanatory figure 8 in supplementary.

  10. “single-lag dependencies”: We say that a set of time series () have “single-lag dependencies” if all the have only one lag for each pair . Otherwise we refer to “multiple-lag dependencies”.

Having introduced the necessary terminology, we assume that the graph satisfies the following assumptions. Note that the first four are standard assumptions of time series analysis and causal discovery, while assumptions 2 - 2 impose some kind of restrictions on the connectivity of the graph.

Assumptions:

  1. Causal Markov condition 111def. 1.2.2 (Pearl, 2009), see definition in suppl. 7.1in the full time graph

  2. Causal Faithfulness in the full time graph 222see definition in suppl. 7.1

  3. No backward arrows in time

  4. Stationary full time graph: the full time graph is invariant under a joint time shift of all variables

  5. The full time graph is acyclic.

  6. The target time series is a sink node in the summary graph; it does not affect any other variables in the graph.

  7. There is an arrow . Note that arrows need not exit, we then call memoryless.

  8. There are no arrows for .

  9. Every variable that affects directly (no intermediate observed nodes in the path in the summary graph) should be memoryless (), and should have single-lag dependencies with in the full time graph (see def. 2).333(Note that assumption is required only for the completeness of the algorithm against direct false negatives (Theorem 2). For the sufficient conditions alone, its violation does not spoil Theorem 1. The existence of a latent variable with memory affecting the target time series directly, or of a latent variable affecting directly the target with multiple lags renders impossible the existence of a conditioning set that could d-separate the future of the target variable and the past of any other observed variable.)

Below, we present two theorems for detection of causes in the full time graph. Theorem 1 provides sufficient conditions for identifying direct causes. Theorem 2 provides necessary conditions for identifying all the direct causes of a target time series.

Intuition for proposed conditions in Theorems 1 and 2:

The idea is to isolate the path - - in the full time graph, and extract triplets as in (Mastakouri et al., 2019). This way we can exploit the fact that if there is a confounding path between and , then will be a collider that will unblock the path between and when we condition on it. In this path “- -” means or and (if observed) in addition to any other intermediate variable in the path - - must . Mastakouri et al. (2019) proposed sufficient conditions for causal feature selection in a DAG (no time-series) where a cause of a potential cause was known or could be assumed due to time-ordered pair of variables.

Here the goal is to propose necessary and sufficient conditions that will differentiate between being a common cause or - - being a (in)direct edge to in the full time graph.

Figure 1: Visualization of a simple full time graph of two observed, one potentionally hidden and one target time series. The summary graph is presented to emphasize that the notions of “pb-confounded” and “sg-confounded” are different and to point out the challenge of identifying sg-unconfounded causal relations in time series, where the past of each time series introduces dependencies that are not obvious in the summary graph.

Figure 1 visualizes why time-series raise an additional challenge for identifying sg-unconfounded causal relations. While the influence of on is unconfounded in the summary graph, the influence is confounded in the full time graph due to its own past; for example and are confounded by . Therefore we need to condition on to remove past dependencies. If no other time series were present, that would be sufficient. However, in the presence of other time series affecting the target , becomes a collider that unblocks dependencies. If for example we want to examine as a candidate cause, we need first to condition on , the past of the . Following, we need to condition to one node from each time series that enter (which is a collider) to avoid all the dependencies that might be created by conditioning on it. It is enough to condition only on these nodes for the following reason: If a node has a lag-dependency with , then there is an (un)directed path from to . If this path is a confounding one, then conditioning on is not necessary, but also not harmful, because the future of this time series in the full graph is still independent of . This independence is forced by the fact that the is a collider because of the stationarity of graphs and this collider is by construction not in the conditioning set. If is connected with via a directed link (as in fig. 1), then conditioning on is necessary to block the parallel path created by its future values . Based on this idea of isolating the path of interest, we build the conditioning set as described in Theorem 1 and its converse Theorem 2, where we prove the necessity and sufficiency of our conditions.

Theorem 1.

[Sufficient conditions for identifying a direct or indirect cause of ] Assuming 2-2, let be the minimum lag (see 2) between and . Further, let . Then, for every time series we define a conditioning set .

If

(1)

and

(2)

are true, then

and the path between the two nodes is pb-unconfounded.

We can think of as the set that contains only one node from each time series and this node is the one that enters the node due to a directed or confounded path (if exists then the node is the one at ).

Proof.

(Proof by contradiction)
We need to show that if or if the path is pb-confounded then at least one of the conditions 1 and 2 is violated.

First assume that there is no directed path between and : . Then, there is a confounding path without any colliders. (Colliders cannot exist in the path by the definition of the lag 2.) In that case we will show that either condition 1 or 2 is violated. If all the existing confounding paths contain an observed confounder (there can be only one confounder since in this case there are no colliders in the path), then condition 1 is violated, because we condition on which d-separates and . If in all the existing confounding paths the confounder node but some observed non-collider node is in the path and this node belongs to , then condition 1 is violated, because we condition on which d-separates and . If there is at least one confounding path and its confounder node does no belong in and no other observed (non-collider or descendant of collider) node which is in the path belongs in then condition 2 is violated for the following reasons: Let’s name . We know the existence of the path , due to assumption 2.

  • If and have in common, then is a collider. Therefore, adding in the conditioning set would unblock the path between and .

  • If and have in common, that means lies on . In this case is not in the path from to and hence adding to the conditioning set could not d-separate and .

In both cases condition 2 is violated.

Now, assume that there is a directed path but it is “pb-confounded” (there exist also a parallel confounding path ). Then, if and have in common, then condition 2 is violated due to 2. If and have in common, then condition 2 is violated due to 2.

In all the above cases we show that if conditions 1 and 2 hold, then is a “pb-unconfounded” direct or indirect cause of . ∎

Remark.

Theorem 1 conditions hold for any lag as defined in 2; not only for the minimum lag. The reason why we refer to the minimum lag in 1 is to have conditions closer to its converse Theorem 2.

Theorem 2.

[Necessary conditions for a direct cause of ] (almost converse of Theorem 1)

Let the assumptions and the definitions of Theorem 1 hold, with the additional assumption that here we consider only “single-lag dependencies” (see 2).

If is a direct, “sg-unconfounded” cause of (), then conditions 1 and 2 of Theorem 1 hold.

Proof.

(Proof by contradiction)
Assume that the direct path exists and it is unconfounded. Then, condition 1 is true. Now assume that condition 2 does not hold. This would mean that the set does not d-separate and . Note that a path is said to be d-separated by a set of nodes in if and only if contains a chain or a fork such that the middle node is in , or if contains a collider such that neither the middle node nor any of its descendants are in the . Hence, a violation of condition 2 would imply that (a) there is some middle node or descendant of a collider in and no non-collider node in this path belongs to this set, or (b) that there is a collider-free path between and that does not contain any node in .

  • There is some middle node or descendant of a collider in and no non-collider node in this path belongs to this set:
    (a1:) If there is at least one path - - - - where is a middle node of a collider and none of the non-collider nodes in the path belongs to
    : Such a path could be formed only if in addition to some directly caused . Then - -. (Due to our assumption for single-lag dependencies (see 2) a path of the form - - could not exist). Then, due to stationarity of graphs the node will enter . If this is hidden (), then due to assumption 2 this time series will be memoryless (). Therefore, the collider in the conditioning set will not unblock any path between and that could contain . If is observed () then due to assumption 2 the path will be - -. However, this path is always blocked by due to the rule we use to construct . That means a non-collider node in the conditioning set will necessarily be in the path , which contradicts the original statement.

    (a2:) If there is at least one path - -- - where is a middle node of a collider and none of the non-collider nodes in the path belongs to : This could only mean that there is a confounder between the target and . However this contradicts that is an “sg-unconfounded” direct causal path.

    (a3:) If there is at least one path - -- - where with is a middle node of a collider and no non-collider node in the path belongs to : In this case, because . By construction of all the observed nodes in that enter the node belong in . That means that enters the node . Hence, in the path will necessarily be a non-collider node which belongs to the conditioning set. This contradicts the original statement “and no non-collider node in the path belongs to ”.

    (a4:) If a descendent of a collider in the path - - - - belongs to the conditioning set and no non-collider node in the path belongs to it: Due to the single-lag dependencies assumption, otherwise there are multiple-lag effects from to . That means that, independent of being hidden or not, the in the collider path will enter the node . If then because enter the node , . In the first case only and in the latter case also are a non-collider variable in the path that belongs to the conditioning set, which contradicts the statement of (a4). If the collider , as explained in (a3) at least one non-collider variable in the path will belong in the conditioning set, which contradicts the statement (a4). Finally, if and are hidden, if then the node is necessarily in the path as a pass-through node, which contradicts the statement (a4). If then the single-lag dependencies assumption is violated.

  • There is a collider-free path between and that does not contain any node in :
    Such a path would imply the existence of a hidden confounder between and or the existence of a direct edge from to . The former cannot exist because we know that is an sg-unconfounded direct cause of . The latter would imply that there are multiple lags of direct dependency between and which contradicts our assumption for single-lag dependencies.

Therefore we showed that whenever is an sg-unconfounded causal path, the conditions are necessary. ∎

Since it is unclear how to identify the lag in 2, we introduce the following lemmas for the detection of the minimum lag that we require in the theorems. We provide the proofs of the lemmas in suppl. 7.3).

Lemma 1.

If the paths between and are directed then the minimum lag as defined in 2 coincides with the minimum non-negative integer for which . The only case where is when there is a confounding path between and that contains a node from a third time series with memory. In this case .

Lemma 2.

Theorems 1 and 2 are valid if the minimum lag as defined in 2 is replaced with from lemma 1.

Using the condition in Lemma 1

via lasso regression and the two conditions in Theorems

1 and 2 we build an algorithm to identify direct and indirect causes on time series. The input is a 2D array

(candidate time series) and a vector

(target), and the output a set with indices of the time series that were identified as causes. Python code is provided in the supplementary. The complexity of our algorithm is for candidate time series, assuming constant execution time for the conditional independence test.

Input: .

Output: causes_of_R

= shape causes_of_R

for  to  do

      
pvalue1

if pvalue1 threshold1 then
             pvalue2

if pvalue2 threshold2  then
                   causes_of_R
             end if
            
       end if
      
end for
Algorithm 1 SyPI Algorithm for Theorems 1 and 2.

3 Experiments

3.1 Simulated experiments: time series construction

To test our method, we build simulated time series with various hidden variables, always respecting the aforementioned assumptions. We sampled 100 random graphs for the following tuples of hyperparameters: (# samples, # hidden variables, # observed variables, density of edges between candidate time series, density of edges between time series and target series, noise variance). We then calculate the false positive (FPR) and false negative rates (FNR) for these 100 random graphs. The possible values for each hyperparameter in the tuple are the following: # samples

, # hidden variables , # observed variables , Bernoulli() existence of edge between candidate time series , Bernoulli() existence of edge between candidate time series and target series and noise variance

. During the construction of the time series, every time step is calculated as the weighted sum of the previous step of all the incoming time series, including the previous step of the current time series. The weights of the adjacent matrix between the time series are selected from a uniform distribution in the range

if they have not been set to zero (we thus prevent too weak edges, which would result in almost non-faithfulness distributions that render the problem of detecting causes impossible).

The two conditional independence tests are calculated with partial correlation, since our simulations are linear, but there is no restriction for non-linear systems (see extension in 5.2). For the “lag” calculation step of our method, we use lasso in a bivariate form between each node in in the summary graph and (for the non-linear case lasso could be replaced with a non-linear regressor). We did some exploratory search across different values for the regularization parameter and the threshold on the coefficients of this step. We found out that for regularization and mostly any threshold in the region 0.1 to 0.15 for the returned coefficients of lasso, the results are stable. So we fixed these two parameters once before running the experiments, without re-adjusting them for the different types of graphs.

For all the above experiments we simulated the time series with unique direct lag of 1. Although our theory is complete against false negatives only for single-lag dependencies, we wanted to test the performance of our method even in the presence of multiple lags. Therefore we examined the performance for 4 and 5 observed, 1 additional hidden and 1 target time series, for 2, 3 and 4 co-existing lag direct effects. We decide for the existence of a lag sampling from a Bernoulli distribution with

.

We now compare our algorithm to Lasso-Granger (Arnold et al., 2007) for 2 hidden and 3, 4 and 5 observed time series. Our algorithm operates with two thresholds for the values of the two tests, one (threshold1) for rejecting independence in the first condition, and a second (threshold2) for accepting dependency in the seccond condition. Lasso-Granger (Arnold et al., 2007) operates with one hyper-parameter: the regularization parameter . To ensure a fair comparison, we tuned the parameter for Lasso-Granger (not our method) such as to allow it at least the same FNR as our method, for same type of graphs. We did not do the comparison based on matching FPR, because, Lasso-Granger generates many FPs in the presence of hidden confounders, and this would not change the ordering of the ROC curves (note that we optimize Lasso-Granger for each FNR). For all the aforementioned experiments apart from the comparison of the two methods, we used threshold1 and threshold2. We produced ROC curves for the two methods as follows: for Lasso-Granger, we varied the parameter across . For our method (SyPI), we varied only threshold1 and threshold2, keeping their ratio equal to , using values in . Note that in our simulations we do not constrain our graphs to be acyclic in the summary graph.

Finally, for the last part of our simulated experiments we compare our method against seqICP (Pfister et al., 2019) and against PCMCI (Runge et al., 2019b). We performed ten experiments with 20 random graphs each, with 2 to 6 observed and 1 to 2 hidden series, for sample size 2000 and medium density. For our method we kept the same thresholds, as we defined above: threshold1 and threshold2.

3.2 Experiments on real-data

Finally, we examine the performance of our method on real data, where we have no guarantee that our assumptions hold true. We use the official recorded prices of dairy products in Europe (EU, ) (data provided in the suppl. as well). The target of our analysis is ’Butter’. According to the manufacturing process of dairy products as described in Soliman & Mashhour (2011), we know that the first material for butter manufacturing is ’Raw Milk’ and also that butter is not used as ingredient for the other dairy products in the list. Therefore, we can hypothesize that the direct cause of Butter prices is the price of Raw Milk, and that the other nodes in the graph (other cheese, WMP, SMP, Whey Powder) are not causing Butter’s price. We examine three countries, two of which provide data for the “Raw Milk” (Germany ’DE’ (8 tim series) and Ireland ’IE’ (6 time series)) and one where these values are not provided (United Kingdom ’UK’ (4 time series)). This last dataset was on purpose selected as this would be a good realistic scenario of a hidden confounder. In that case our method must not identify any cause. As we have extremely low sample sizes (<180) identifying dependencies is particularly hard. For that reason we set 0 threshold on our lag detector and the threshold1 at for accepting dependence in the first condition. We leave anything else unchanged as in the simulation experiments.

4 Results

4.1 FPR and FNR for various densities and graph size

First, we wanted to examine the performance of our method for various density of edges among the candidate series, and between the candidates and the target time series (see 3.1). In fig.2-2 we present results for a medium noise level (20%) and for sample sizes 500, 1000, 2000 and 3000. The FPR are very small () for sample size and the results are similar for larger or smaller noise levels (see suppl.). Here, we present results for 1, 4 and 8 observed time series, 1 additional hidden and 1 target, to show how the graph size affects the rates.

Figure 2: FPR and FNR for varying numbers of observed, 1 additional hidden and 1 target series, for varying sample size (columns) and sparsity of edges among the candidate causes (x-axis) and between the candidate causes and the target (y-axis). The total FNR (for indirect and direct causes) is depicted by the gray scale, where black means and white means . The FNR that refers to the direct causes (for which our method is proven to be complete) is written in red in the middle of each cell. (a) 1 observed time series. The FPR are practically zero, and the total FNR 20% for dense graphs. Notice that the FNR of the direct causes is always low starting from just 16% for dense up to 26% for sparse graphs. (b) 4 observed time series. As we can see, for sample sizes the FPR remain practically zero, and the FNR for direct causes 22% for sparse and 45% for dense graphs. (c) 8 observed time series. For sample sizes the FPR still remain practically zero. The FNR of the direct causes is just 31% for sparse graphs and up to 38% for large and very dense ones.

With red color in each cell we present the percentage of the FNR that corresponds to the direct causes that were missed, since our method is complete for direct only. Since our claims refer to complete conditions for unconfounded direct causes, we also encounter as false positives the confounded direct causes. Overall, we see that our algorithm performs with almost zero FPR independent of the noise, the density or the size of the graphs. FNR are low for the direct causes starting from 16% for small and sparse graphs and not exceeding 45% for very large and dense graphs.

The results for different number of observed time series and noises are presented in the supplementary 7.5.

4.2 FPR and FNR for varying # of hidden nodes

In fig. 3 we present the behaviour of our algorithm in moderately dense graphs, for 2000 sample size, 20% noise variance and different number of hidden variables. We can see that the FPR is close to zero, independent of the number of hidden variables. Although the total FNR increases with the number of time series, the percentage that corresponds to direct causes ranges just from 30 to 40%. Results are similar for different densities of edges (see suppl. 7.5.2).

Figure 3: FPR and FNR for varying numbers of hidden and observed series, noise variance and sample size 2000, for moderate edge density. FPR is very low (max 1.2% for high noise) for any number of hidden series. Notice that although the total FNR increases with the graph size, the FNR for the direct causes (dashed lines), for which our method is complete, does not exceed 40%.

4.3 FPR and FNR for “multiple-lag dependencies”

For Bernoulli probability of existence of an edge between the time series (

) and the time series and the target () we calculate FPR and FNR for different number of lags that can exist between the time series.

Figure 4: FPR and FNR for different number of coexisitng lags. Notice that the FPR is very low as expecteed by Theorem 1. Since our method is complete only for single-lag dependencies, we notice that the FNR both direct causes (dashed lines) for which our method is complete, and for indirect causes increases.

We fix the number of samples at 2000, and the noise variance at 20%, for 2, 3 and 4 lags. We examine the above combination for moderately large graphs with 1 hidden, 1 target and 5 observed time series. As depicted in fig. 4, our method seems to perform very well in terms of FPR, independent of the number of co-existing lags between the time series. As our method is complete only for single-lag dependencies, we don’t expect very low FNR. However, we see that the FNR that refer to direct causes only does not exceed 45%.

4.4 Comparison against Lasso-Granger causality

We compare the performance of our method against the commonly used Lasso-Granger method. We examine the performance of the two methods for relatively dense graphs, for 2 hidden, 1 target and 3,4 or 5 observed time series. In fig. 6 we see that even in such confounded graphs our method always performs with almost zero FPR, while Lasso-Granger reaches up to 16%, for similar or even larger FNR.

Figure 5: Yellow: ROC curve of Lasso-Granger for different values of the parameter. Red: ROC curve of our method for different values of threshold1 and threshold2 with fixed ratio of 1. The ROC curves were calculated over 100 random graphs, for different density of edges (three columns) and a moderate number of observed series with additional two hidden ones. Our method’s ROC curve is always above the Granger’s ROC.

In fig. 5 we present the ROC curve for the performance of SyPI and Lasso-Granger for the same graphs. Since our method functions with two conditions and two p-values, we did not manage to find logical pairs of thresholds that increase further the FPR. We see that at all operating points our method outperforms Lasso-Granger.

Figure 6: Comparison of our method against Lasso-Granger, for sample size 2000, 2 hidden variables, 20% noise variance, for varying number of observed time series and sparsity of edges. As we can see, we tuned the regulariser for the Lasso-Granger to achive similar FNR for similar graphs as SyPI. Nevertheless, SyPI still performs with lower or equal FNR and with a stable almost zero FPR. In contrast, Lasso-Granger reaches up to 16% FPR. Not tuning for Lasso-Granger led to even larger FPR.

4.5 Comparison against seqICP and PCMCI

As it can be seen in figure 7, our method SyPI outperforms both methods for all type of full time graphs, yielding FPR and FNR between and . SeqICP yielded up to FPR and around FNR for almost all the graphs. This result is not surprising, as with hidden confounders seqICP will detect only a subset of the ancestors AN(Y), and in addition, it assumes that interventions exist in the input dataset. PCMCI yielded up to FPR and oscillated around FNR. It terms of performance times, SyPI was the fastest, followed by PCMCI; seqICP was rather slow for more than 5 time series.

Figure 7: Comparison of SyPI against seqICP and PCMCI, for the same full time graphs. False positive and false negative rates are reported over 20 random graphs of similar type (# observed, # hidden time series) for each of the 10 types. Our method SyPI outperforms both methods, with FPR and FNR . SeqICP yielded FPR and FNR. This is not surprising, as with hidden confounders seqICP will detect only a subset of the ancestors of Y. PCMCI yielded FPR with FNR.

4.6 Experiments on real data: Dairy product prices

We applied our algorithm on the dairy-product price datasets for ’DE’ (8 time series), ’IE’ (6 time series) and ’UK’ (6 time series). The sample sizes are just 178, 113 and 168 accordingly. Data are depicted in fig. 9a, 9b, 9c in the suppl. Our algorithm successfully identifies ’Raw Milk’ as the direct cause of ’Butter’ in the ’IE’ dataset, correctly rejecting all the other 4 nodes ( TPR, TNR). In the ’DE’ dataset ’Raw Milk’ is correctly identified and there is only one false positive (’Edam’); all the other 6 nodes are successfully rejected ( TPR and TNR). Even more importantly, in the ’UK’ dataset where there are no provided measurements for ’Raw Milk’ prices (hidden confounder), our algorithm does not identify any cause ( TNR).

5 Discussion

5.1 Efficient conditioning set in terms of minimum asymptotic variance

In contrast to other approaches (like PC and FCI), our method does not search over a large set of possible combinations to identify the right conditioning sets. Instead, for each potential cause it directly constructs its ‘separating set’ for the nodes and (condition 2), from a pre-processing step that identifies () the nodes of the time series that enter the previous node of . The resulting conditioning set contains therefore covariates that enter the outcome node , and not the potential cause . Adjustment sets that include parents of the potential cause node are considered inefficient in terms of asymptotic variance of the total effect, as the parents of that node can be strongly correlated with it (Henckel et al., 2019). On the other hand, adding nodes that explain variance in the outcome node (precision variables) can be beneficial. According to Theorem 3.1. of Henckel et al. (2019) our conditioning set has a smaller asymptotic variance compared to a set that would include incoming nodes to or

. Therefore, our choice of conditioning set also contributes to a reasonable signal to noise ratio for the dependences under consideration. This could strengthen the statistical outcome of the conditional independence test.

5.2 Linear and non-linear systems

In theory and in practice, our proposed algorithm can be used for both linear and non-linear relationships between the time series. For the linear case a partial correlation test is sufficient to examine the conditional dependencies, while in the non-linear case KCI (Zhang et al., 2012), KCIPT (Doran et al., 2014) or FCIT (Chalupka et al., 2018) could be used.

5.3 Multiple-lag effects

Although our algorithm performs equally well for FPR in simulations with “multiple-lag dependencies”, our theory is necessary only for “single-lag dependencies” (see 2). We could allow for “multiple-lag dependencies” if we were willing to condition on larger sets of nodes, which we do not find acceptable for statistical reasons. Right now we require one node the most from each observed time series for the conditioning set. In a naive approach, coexisting time lags would require nodes from each time series in the conditioning set, but the theory is getting cumbersome. As a future work (as necessary conditions for multi-lags are out of the scope of this paper), in the suppl. 7.6 we show how only in the multi-lag bivariate case (one candidate, one target), where hidden confounders are memoryless, it is still possible to have sufficient and necessary conditions, subject to some extensions in the theory.

5.4 Comparison with related work

Pfister et al. (2019), have similar goal with us, that of causal feature selection, however, their method seqICP relies on a quasi-interventional scenario with different background conditions. They require sufficient interventions in the dataset, which should particularly affect only the input and not the target. SeqICP predicts the ancestors of target Y, (AN(Y)). In the presence of hidden confounders, the authors report that under further assumptions they will detect a subset of AN(Y), only if the dataset contains sufficient interventions on the predictors. Given our assumptions, we prove that our method SyPI will detect all the parents of Y (not only a subset), even in presence of latent confounders, without requiring interventions in the dataset. Our method’s complexity is , while seqICP is for sparse graphs. Our main concern with this method is the hard requirement for interventions in the dataset; something that would require randomized-controlled trials (RCTs), which in case that they are possible, of course render any causal inference method less relevant, as they can directly lead to causal conclusions. Our method aims to causal inference based solely on observations. Runge et al. (2019b) proposed PCMCI, which focuses on the problem of full graph causal discovery. Obviously our aim for causal feature selection is narrower, but still non-trivial. Nevertheless, their algorithm is easily adaptable to detect causes of a given target. Furthermore, Runge et al. (2019b) for this method assume causal sufficiency, which we find particularly hard to meet in real datasets. As shown in Fig. 7 our method outperforms both seqICP and PCMCI. Malinsky & Spirtes (2018) proposed SVAR-FCI, which focuses on the full graph discovery in time series data. As we mention above, their aim is broader, while we focus on detection of causes of a given target. SVAR-FCI returns PAGs, which is justified given the hard task and the limited assumptions the authors make to restrict the graph. On the other hand, our method returns the exact edges from causes to target. SVAR-FCI is computationally intensive with exhaustive conditional independence tests for all lags and conditioning sets. SyPI calculates in advance both for each conditional independence, significantly reducing testing. Also, SVAR-FCI, as its authors mention, is not complete even for PAGs, while SyPI is proven to be complete against false negatives given our assumptions. Finally, as we describe in detain in 5.1, our conditioning set is efficient in terms of minimum asymptotic variance; this does not hold for SVAR-FCI.

5.5 Technical assumptions of SyPI

Although our technical assumptions are many, we do not consider them extreme, given the hardness of the problem of hidden confounding, nor hard to be met. Entner & Hoyer or Malinsky & Spirtes don’t need these assumptions, as they exhaustively perform conditional independence tests for all lags and time series. Assumptions 2 and 2 assure that are time series with dependency form their previous time step. Our assumption 2 tackles the serious problem that auto-lag hidden confounders create. This problem is also stressed by Malinsky & Spirtes Malinsky & Spirtes (2018): “auto-lag confounders can be particularly problematic since they induce infinite-lag associations” and by not taking assumptions for it, they are limited to find only PAGs.

6 Conclusion

Causal feature selection in time series is a fundamental problem in several fields ranging from biology, economics to climate research. In these fields often the causes of a time series of interest (i.e. revenue, temperature) need to be identified from a pool of candidate time series, while latent variables cannot be excluded. Here we target this problem by constructing conditions which we prove to be necessary for direct causes in single-lag dependency graphs, even in the presence of latent variables, and sufficient for direct and indirect causes in multi-lag dependency graphs. To the best of our knowledge, our method SyPI is the first complete and sound algorithm (subject to appropriate graphical assumptions) for direct causal feature selection in time series that does not assume causal sufficiency and acyclic summary graph, thus overcoming the shortcomings of Granger Causality. Our simulation results show that for a range of graph sizes and densities, SyPI outperforms Lasso-Granger. FPR are essentially zero for sample sizes , and the FNR for direct causes never exceeds at dense large graphs. Moreover, in experiments on real data, where most of our assumptions could potentially be violated and where the sample sizes are particularly small for the graph size, SyPI performed with almost true positive and negative rate.

References

  • Arnold et al. (2007) Arnold, A., Liu, Y., and Abe, N. Temporal causal modeling with Graphical Granger Methods. pp. 66–75, 2007.
  • Chalupka et al. (2018) Chalupka, K., Perona, P., and Eberhardt, F. Fast conditional independence test for vector variables with large sample sizes. ArXiv, abs/1804.02747, 2018.
  • Doran et al. (2014) Doran, G., Muandet, K., Zhang, K., and Schölkopf, B. A permutation-based kernel conditional independence test. In

    Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence

    , pp. 132–141, 2014.
  • Eichler (2007) Eichler, M. Causal inference from time series: What can be learned from Granger causality. In Proceedings of the 13th International Congress of Logic, Methodology and Philosophy of Science, pp. 1–12. King’s College Publications London, 2007.
  • Entner & Hoyer (2010) Entner, D. and Hoyer, P. O. On causal discovery from time series data using FCI. Probabilistic graphical models, pp. 121–128, 2010.
  • (6) EU. European union prices of dairy products. https://ec.europa.eu/info/food-farming-fisheries/farming/facts-and-figures/markets/prices/price-monitoring-sector/.
  • Granger (1969) Granger, C. W. J. Investigating causal relations by econometric models and crossspectral methods. Econometrica, 37:424–438, 1969.
  • Granger (1980) Granger, C. W. J. Testing for causality, a personal viewpoint., volume 2. 1980.
  • Guo et al. (2008) Guo, S., Seth, A. K., Kendrick, K. M., Zhou, C., and Feng, J. Partial granger causality-Eliminating exogenous inputs and latent variables. Journal of Neuroscience Methods, 172(1):79 – 93, 2008.
  • Henckel et al. (2019) Henckel, L., Perković, E., and Maathuis, M. H. Graphical criteria for efficient total effect estimation via adjustment in causal linear models. arXiv, 2019.
  • Hung et al. (2014) Hung, Y.-C., Tseng, N.-F., and Balakrishnan, N. Trimmed granger causality between two groups of time series. Electron. J. Statist., 8(2):1940–1972, 2014.
  • Malinsky & Spirtes (2018) Malinsky, D. and Spirtes, P. Causal structure learning from multivariate time series in settings with unmeasured confounding. In Proceedings of 2018 ACM SIGKDD Workshop on Causal Disocvery, volume 92 of

    Proceedings of Machine Learning Research

    , pp. 23–47, 2018.
  • Mastakouri et al. (2019) Mastakouri, A., Schölkopf, B., and Janzing, D. Selecting causal brain features with a single conditional independence test per feature. In Advances in Neural Information Processing Systems 32, 2019.
  • Pearl (2009) Pearl, J. Causality. Cambridge University Press, 2nd edition, 2009.
  • Peters et al. (2017) Peters, J., Janzing, D., and Schölkopf, B. Elements of Causal Inference - Foundations and Learning Algorithms. Adaptive Computation and Machine Learning Series. The MIT Press, Cambridge, MA, USA, 2017.
  • Pfister et al. (2019) Pfister, N., Bühlmann, P., and Peters, J. Invariant causal prediction for sequential data. Journal of the American Statistical Association, 114(527):1264–1276, 2019.
  • Runge (2018) Runge, J. Causal network reconstruction from time series: From theoretical assumptions to practical estimation. Chaos: An Interdisciplinary Journal of Nonlinear Science, 28(7):075310, 2018.
  • Runge et al. (2019a) Runge, J., Bathiany, S., Bollt, E., Camps-Valls, G., Coumou, D., Deyle, E., Glymour, C., Kretschmer, M., Mahecha, M. D., Muñoz-Marí, J., van Nes, E. H., Peters, J., Quax, R., Reichstein, M., Scheffer, M., Schölkopf, B., Spirtes, P., Sugihara, G., Sun, J., Zhang, K., and Zscheischler, J. Inferring causation from time series in earth system sciences. Nature Communications, pp. 2553, 2019a.
  • Runge et al. (2019b) Runge, J., Nowack, P., Kretschmer, M., Flaxman, S., and Sejdinovic, D. Detecting and quantifying causal associations in large nonlinear time series datasets. Science Advances, 5(11):eaau4996, 2019b.
  • Soliman & Mashhour (2011) Soliman, I. and Mashhour, A. Dairy marketing system performance in egypt. 01 2011.
  • Spirtes et al. (1993) Spirtes, P., Glymour, C., and Scheines, R. Causation, Prediction, and Search. 1993.
  • Wiener (1956) Wiener, N. The theory of prediction, Modern mathematics for the engineer, volume 8. 1956.
  • Zhang et al. (2012) Zhang, K., Peters, J., Janzing, D., and Schölkopf, B. Kernel-based conditional independence test and application in causal discovery. UAI, 2012.