I Introduction
Improving our understanding of causeeffect relationships when analyzing realworld complex systems, e.g., economy, climate and neurosciences, helps in addressing questions like what factors affect our health, the economy, climate and which actions need to be taken to mitigate the adverse effects and enhance beneficial effects? One of the main challenges in inferring causeeffect relations in such systems is the nonlinear dependency among variables, which often leads to inaccurate causal discovery. To mitigate this issue, we propose to use deep recurrent networks in combination with counterfactual analysis to infer nonlinear causal relations in multivariate time series. We apply the concept of Granger causality (GC) [36] using probabilistic forecasting with deep autoregressive (DeepAR) networks [6]. GC makes use of the following two properties: (i) Temporal precedence: a cause is followed by its effect and (ii) Physical influence: changing cause alters the effects. It estimates the extent by which a variable helps in the prediction of another variable . The impact of the inclusion of in the model is estimated by comparing the prediction of with and without . Due to its practical implementation, GC has been intensively used in various fields i.e. neuroscience, climatology, finance, etc., for discovering causal graphs in time series [7, 8, 5, 25].
A major shortcoming of Granger causality however is in its assumption of linear dependency in the observational data. Deep networks excel in learning complex and nonlinear relations. In this paper, we use DeepAR networks [6] to model nonlinear relations in multivariate time series. DeepAR characterizes conditioning on observed variables and extracts hidden features from the time series which helps in dealing with spurious associations that might occur due to confounders such as periodic cycles and trends. However, deep networks can not handle missing input or outofdistribution data. To deal with this issue, we propose to use the theoretically wellestablished Knockoffs framework [35] to generate intervention variables. The Knockoffs framework was originally developed as a variable selection tool with a controllable false discovery rate in a prediction model. Knockoff samples are statistically nullvariables, i.e., independent of the model output, and can be swapped with the original variables without changing the underlying distribution of the data. The indistribution nature of the knockoffs is important when used to generate counterfactuals for causal discovery because the trained deep networks expect test data to be within the same distribution as training data. Moreover, knockoff samples control false discovery rate in causal inference as it holds as low correlation with the candidate variable as possible.
Causal hierarchy operates in three layers (i) Association (ii) Intervention and (iii) Counterfactual [40]. We use counterfactuals where questions like what if certain thing has been done differently? Or is it certain action that causes the change? are addressed. A counterfactual outcome describes a causeeffect scenario such that if event had not occurred, event
would not have happened. Counterfactuals have been used for feature selection, time series explanation, and model interpretability
[21, 19]. We generate counterfactual output by substituting one of the observed variables at a time with its knockoff samples and estimate its causal impact by measuring its effect on the forecast error assuming that the used multivariate time series are stationary and there exist no hidden confounders. We compare the causal inference performance of the proposed method with the widely used vector autoregressive Granger causality (VARGC) and PCMCI [18] using synthetic as well as real river discharges datasets. We show that our method outperforms these methods in estimating nonlinear causal links. To highlight the advantage of using knockoffs based counterfactuals, we carry out a comparison using different intervention time series, i.e., distributionmean, outofdistribution samples, and knockoffs.The remainder of this paper is organized as follows. In Section II of the paper, we cover the related work. Methodological background are presented in Section III. In Section IV, we describe in detail the proposed causal inference method. Experimental results are presented and discussed in Section V. The work is concluded in Section VI of the paper.
Ii Related Work
Several methods in the literature have addressed the challenge of causal inference in nonlinear multivariate time series. The method of [14]
uses structured multilayer perceptrons and recurrent neural networks combined with sparsityinducing penalties on the weights. It monitors the weights assigned to each time series when predicting the target time series. The authors of
[32]also proposed a similar solution for causal discovery using deep learning where the architecture of the neural network is partially utilized from data to infer causality among variables. They chose to interpret the weights of a neural network under some constraints. The method trains as many neural networks models as the number of variables in the dataset, where one variable is the target and the rest of the variables are inputs to predict the target. Analyzing the weights of the network reveals whether there is a relationship between a variable and the target: if any of the paths between a variable and the target has a weight value as 0, the target does not have a causal link with that particular variable. However, it is challenging to extract any clear structure of the data from the deep networks that can be used for interpretability
[39]. Attentionbased convolutional neural networks have also been used for causal discovery, where they utilize the internal parameters of the networks to find out time delay in the causal relationship between variables
[15].The PCMCI [18] method estimates causality in multivariate time series using linear and nonlinear conditional independent tests. PCMCI works in two steps, first, it determines the causal parents of each variable and then applies conditional independence test to estimate causality between variables. PCMCI has been implemented in [16] to estimate causal networks in biosphere–atmosphere interaction. PCMCI also used along with dimensionality reduction to infer causality in ecological time series data [17] where conditional independence tests are applied on the reduced feature space for causal inference.
Causeeffect variational autoencoders (CEVAE)
[33] estimate nonlinear causal links in the presence of hidden confounders by training two separate networks, with and without intervention. This method however estimates the causal link of two variables only. The extended version of CEVAE to time series has been developed by integrating the domainspecific knowledge for inferring nonlinear causality in ecological time series [34]. Additive models are also used to estimate nonlinear interactions in time series, where the past of each series may have an additive nonlinear effect that decouples across time series [26], however, it may overlook important nonlinear relation between variables and therefore may fail to properly identify Granger causal links [14]. Counterfactuals are used by [19]to estimate feature importance. They substitute features to obtain counterfactuals and estimate feature importance by finding their impact on the output. Unlike our approach, it does not estimate causal relations in time series and does not use knockoff indistribution samples to reduce the false discovery rate. Knockoffs have been used to generate counterfactuals to explain the decision of deep image classifiers
[29]. To deal with spurious causal links that may results from periodic patterns and trends in time series, the method in [37, 38] proposed to use time frequency causality analysis. However, these methods use the spectral representation of linear VARGC and hence may not be able to deal with nonlinear causal relations.Iii Methodological Background
In this section we will briefly introduce the necessary background on DeepAR [6], the fundamental concept of Knockoffs framework [22], as well as VARGC.
Iiia DeepAR
DeepAR [6] is a powerful method mainly used for multivariate nonlinear nonstationary time series forecasting. It is based on autoregressive recurrent neural networks, which learn a global model from history of all the related timeseries in the dataset. DeepAR is based on previous work on deep learning for time series data [20] and uses a similar LSTMbased recurrent neural network architecture to the probabilistic forecasting problem. Few key attributes of DeepAR compared to classical approaches and other global forecasting methods are: It extracts hidden features in the form of seasonal patterns and trends in the data, and characterizes conditioning on observed variables and extracted features during inference. Moreover, in this probabilistic forecasting method, lesser handcrafted feature engineering is needed to capture complex, groupdependent behavior.
The training and forecast processes of the DeepAR is shown in Fig. 1. Let be the variate time series. Each time series is a realization of length realvalued discrete stochastic process . At each time step , the inputs to the network are the observed target value at the previous time step as well as the previous network output . The network output then used to compute the parameters of the likelihood , which is used for training the model parameters. For prediction, the past of the time series is fed in for , where represents the first value in the forecast horizon, then in the forecast horizon for a sample is drawn and fed back for the next point until the end of the prediction range generating one sample trace. Repeating this prediction process yields many traces representing the joint predicted distribution.
IiiB Knockoffs
The Knockoffs framework was developed by Barber and Candès in 2015 as a tool for estimating feature importance using conditional independence testing with controllable false discovery rate [23]. Given a set of observed variables , a target variable , and a predictive model, the knockoffs of the observed variables, defined as , are constructed to be indistribution nullvariables and as decorrelated as possible from the original data. They can be swapped with the original variables without changing the underlying distribution of the data. This is accomplished by ensuring that the correlation between the generated knockoffs is the same as the correlation between the original variables [23]. To be nullvariables, knockoffs do not contain any information about the target variable . These conditions can be formally written as follows.

Exchangeability: , where is obtained by swapping the features () with their knockoffs .

Null variables: , which is guaranteed if is generated without looking at the target variable .
IiiC VAR Granger Causality
GC is based on the idea that cause precedes its effects and can help in their prediction. Here we briefly describe how VAR is used to infer GC. Let be the time series of variables. Each time series is a realization of length realvalued discrete stationary stochastic process . These timeseries can be represented by a order VAR of the form
(1) 
The residuals
form a white noise stationary process with covariance matrix
. The model parameters at time lags comprise the matrix . Let be the covariance matrix of the residual associated to using the model in (1), and let denote the covariance matrix of this residual after missing out the row and column in . The VARGC of on conditioned on all other variables is defined by [27].(2) 
Iv Causal Effect Estimation using DeepAR and Knockoffs Counterfactuals
Let be the variate time series. Each time series is a realization of length realvalued discrete stochastic process . Throughout our study, we make use of the following two assumptions.

Stationarity: is a stationary stochastic process.

Causal Sufficiency: The set of observed variables includes all of the common causes of pairs in .
We train and optimize DeepAR by providing multivariate time series as input. For each realization of the time series , we calculate forecast error as mean absolute percentage error (MAPE) using the trained model.
(3) 
Where is the actual value of time series at time step , and
is the predicted value which is defined by the mean of the probability distribution
at time step t in the forecast horizon. To know the causeeffect relationship of on , we intervene on variable with its knockoffs , and notice the influence on the prediction of a target variable by comparing the obtained counterfactual output with the actual output. In analogy to the definition of VARGC of on in (2), we define our metric for estimating the nonlinear Granger causality of of on conditioned on all other variables by the causal significance score (CSS) as follows.(4) 
where MAPE and MAPE denote the mean absolute percentage error when predicting before and after intervention on . In this work, we investigate multiple ways to generate counterfactuals, i.e. distribution mean, outofdistribution, and knockoff samples as described below. However, we mainly emphasize the use of knockoffs to generate counterfactuals for causal discovery in time series.

Distribution mean: Here we replace each value of the time series with its mean .

Outofdistribution: In this type of intervention, the candidate variable that belongs to data distribution is replaced with a variable from another data distribution such that . Moreover, is selected to be as uncorrelated with the original variable as possible.
We calculate CSS using (4) by incorporating MAPE and MAPE for each realization of the stochastic process before and after intervention on , . Since we have multiple realizations for each time series, as a result, we get a distribution of values of CSS for each pair of time series and
. The null hypothesis
: does not Granger cause , is accepted if the mean of the distribution of CSS is close to zero. In case the mean of the distribution is significantly different from zero, the alternate hypothesis : Granger causes is accepted. The causal relationship for each pair of time series is estimated iteratively and ultimately the full causal graph of the multivariate time series is extracted.V Experiments
We conducted experiments on synthetic as well as a real datasets. Experiments are performed on multiple realizations of the time series. For synthetic data, we set as the length of each realization
of the time series. The prediction length T is set to 14. We trained our model for 150 epochs. The number of layers and cells in each layer of the network is set to 4 and 40 respectively while dropout is
–. We estimate causal link among time series by hypothesis testing on the distribution of CSS as explained in Section IV. To quantify the performance of our method, we calculate the two metrics: false positive rate (FPR) and Fscore where
FPR  (5)  
Fscore  (6) 
Here, TP is the number of causal links that exist in the causal graph and also detected by the method; TN is the number of causal links that neither exist in the causal graph nor detected by the method; FP is the number of causal links that do not exist in the causal graph but detected by the method and FN is the number of the causal links that exist in the causal graph but not detected by the method.
Va Synthetic data
To test the proposed method on synthetic data we generate time series , with seasonal pattern and nonlinear relationships among them as follows. This synthetic model simulate a test model of climateecosystem interactions as suggested in [16], where denotes global radiation R, represents air temperature T while and denote gross primary production GPP and ecosystem respiration respectively.
(7) 
The coupling coefficients and time lags are represented as , , , , and , , . . . , respectively. The notation
, termed as intrinsic or dynamical noise, which represents data from uncorrelated, normally distributed noise. We set
as Gaussian with 0 mean and variances [0.30, 0.35, 0.25]. The value for the coupling coefficients
to and were set to [0.95, 0.80, 0.50, 0.75] and [0.2, 0.4, … , 1.0], respectively. The lags were given only integer values in the range [0, 10] and is set to 10 which is referred to reference temperature in the test data model used in [16]. The seasonal part of is where represents the frequency which takes the value 150 and is the time that varies from 0 to 3000. We keep changing , the coefficient of nonlinear relationship between and , in the test model every simulation while rest of the parameters are kept intact. We are interested to see the consistency and causal estimation capability of our method in comparison with PCMCI and VARGC when is increased.VB Real data
As a realworld application of our method, we carry out causal analysis of average daily discharges of rivers in the upper Danube basin, provided by the Bavarian Environmental Agency ^{1}^{1}1https://www.gkd.bayern.de. We use measurements of three years (20172019) from the Iller at Kempten , the Danube at Dillingen , and the Isar at Lenggries as considered by [24]. While the Iller flows into the Danube upstream of Dillingen with the water from Kempten reaching Dillingen within a day approximately, the Isar reaches the Danube downstream of Dillingen. Based on the stated scenario, we expect a contemporaneous link and no direct causal links between the pairs , and , . Since all variables may be confounded by rainfall or other weather conditions, this choice allows testing the ability of the methods to detect and distinguish directed and bidirected links.
VC Results
For synthetic data model, we demonstrate results in Fig. 2 which shows the Fscore values for PCMCI, VARGC, and DeepAR with all three types of substitution methods represented as DeepARKnockoffs, DeepAROutDist, DeepARMean. The figure illustrates the performance of these methods in response to an increase in the coefficient of nonlinearity between and . VARGC assumes linearity and hence yields a lower Fscore as the dataset contains nonlinear relations among variables in the synthetic data model. PCMCI however has better results compare to VARGC because of using a nonlinear method, Gaussian process distance correlation (GDPC), for testing conditional independence. DeepARKnockoffs, yields better results in terms of Fscore compare to VARGC and PCMCI. DeepAROutDist suffers from a high false discovery rate because of using outofdistribution variables as substitution and hence yield lower Fscore throughout the experiment. DeepARMean performs better than DeepAROutDist but falls behind DeepARKnockoffs. The plot in Fig 3 shows FPR in response to increase in the nonlinearity coefficient . It can be seen that VARGC yields a high FPR due to its assumption of linearity. Besides that, DeepAROutDist also produces a high number of false positives, because outsidedistribution samples introduce bias in the model and hence results in detection of false causal links. PCMCI, DeepARMean, and DeepARKnockoffs however have better false discovery rates.
For real data, the contemporaneous link is correctly detected by DeepARKnockoffs, VARGC and PCMCI as shown in Table I. Our method does not report any causal link however VARGC and PCMCI wrongly finds . PCMCI and VARGC detected a causal link and which may be because of the weather acting as a confounder. DeepARKnockoffs does not report a causal link between and in any direction as the method has the capability of handling confounders, such as seasonal patterns, in a better way. Our method wrongly finds a causal link , however, VARGC and PCMCI do not detect such a causal relationship. Due to extreme events in the river discharges data caused by heavy rainfall, the assumption of stationarity is expected to be violated, which is why VARGC and PCMCI report these false links. In comparison DeepARKnockoffs correctly discovers the expected links in the river discharges data, except a single false link.
Causal links  Expected  VARGC  PCMCI  DeepARKnockoffs 
Yes  Yes  Yes  Yes  

No  Yes  Yes  No 

No  Yes  Yes  No 

No  Yes  No  No 

No  Yes  Yes  No 

No  No  No  Yes 
Vi Conclusion
We proposed a novel method for inferring causeeffect relations in nonlinear multivariate time series. Since deep networks can not handle outofdistribution intervention, we proposed to use probabilistic forecasting with DeepAR in combination of knockoffsbased counterfactuals for estimating nonlinear Granger causality. We applied our method on synthetic and real river discharges datasets. Our results confirm that using knockoff samples to generate counterfactuals yields better results in comparison to using the mean and outofdistribution as intervention methods. Results also indicate that the proposed method outperforms VARGC and PCMCI methods in estimating nonlinear relations and dealing with confounders. It should be noted however that the better performance of the proposed method comes along with the higher computational load associated with using deep networks specially for large multivariate time series.
Vii Acknowledgment
This work is funded by the Carl Zeiss Foundation within the scope of the program line Breakthroughs: Exploring Intelligent Systems for Digitization — explore the basics, use applications and the DFG grant SH 1682/11.
References
 [1] Pearl, Judea. Causal diagrams for empirical research. Biometrika 82.4 (1995): 669688.
 [2] Pearl, Judea. Causality: Models, reasoning and inference cambridge university press. Cambridge, MA, USA, 9 (2000): 1011.
 [3] Lauritzen, Steffen L. Causal inference from graphical models. Complex stochastic systems (2001): 63107.
 [4] Dawid, A. Philip. Influence diagrams for causal modelling and inference. International Statistical Review 70.2 (2002): 161189.
 [5] Granger, Clive WJ. Testing for causality: a personal viewpoint. Journal of Economic Dynamics and control 2 (1980): 329352.
 [6] Salinas, David, et al. DeepAR: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting 36.3 (2020): 11811191.

[7]
Goebel, Rainer, et al. Investigating directed cortical interactions in timeresolved fMRI data using vector autoregressive modeling and Granger causality mapping. Magnetic resonance imaging 21.10 (2003): 12511261.
 [8] Brovelli, Andrea, et al. Beta oscillations in a largescale sensorimotor cortical network: directional influences revealed by Granger causality. Proceedings of the National Academy of Sciences 101.26 (2004): 98499854.
 [9] Lütkepohl, Helmut. New introduction to multiple time series analysis. Springer Science & Business Media, 2005.
 [10] Lozano, Aurelie C., Naoki Abe, Yan Liu, and Saharon Rosset. Grouped graphical Granger modeling methods for temporal causal modeling. In Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 577586. 2009.
 [11] Teräsvirta, Timo, Dag Tjøstheim, and Clive William John Granger. Modelling nonlinear economic time series. Oxford: Oxford University Press, 2010.
 [12] Tong, Howell. Nonlinear Time Series Analysis. (2011): 955958.
 [13] Bose, Eliezer, Marilyn Hravnak, and Susan M. Sereika. Vector autoregressive (VAR) models and granger causality in time series analysis in nursing research: dynamic changes among vital signs prior to cardiorespiratory instability events as an example. Nursing research 66.1 (2017): 12.
 [14] Tank, A., Covert, I., Foti, N., Shojaie, A. and Fox, E., 2018. Neural granger causality for nonlinear time series. stat, 1050, p.16.

[15]
Nauta, Meike, Doina Bucur, and Christin Seifert. Causal discovery with attentionbased convolutional neural networks. Machine Learning and Knowledge Extraction 1.1 (2019): 312340.
 [16] Krich, Christopher, Jakob Runge, Diego G. Miralles, Mirco Migliavacca, Oscar PerezPriego, Tarek ElMadany, Arnaud Carrara, and Miguel D. Mahecha. Estimating causal networks in biosphere–atmosphere interaction with the PCMCI approach. Biogeosciences 17, no. 4 (2020): 10331061.
 [17] Krich, Christopher, Mirco Migliavacca, Diego G. Miralles, Guido Kraemer, Tarek S. ElMadany, Markus Reichstein, Jakob Runge, and Miguel D. Mahecha. Functional convergence of biosphere–atmosphere interactions in response to meteorological conditions. Biogeosciences 18, no. 7 (2021): 23792404.
 [18] Runge, J., Nowack, P., Kretschmer, M., Flaxman, S. and Sejdinovic, D., 2019. Detecting and quantifying causal associations in large nonlinear time series datasets. Science Advances, 5(11), p.eaau4996.
 [19] Tonekaboni, Sana, Shalmali Joshi, David Duvenaud, and Anna Goldenberg. Explaining Time Series by Counterfactuals. (2019).
 [20] Graves, Alex. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850 (2013).
 [21] Delaney, Eoin, Derek Greene, and Mark T. Keane. InstanceBased Counterfactual Explanations for Time Series Classification. arXiv preprint arXiv:2009.13211 (2020).
 [22] Romano, Yaniv, Matteo Sesia, and Emmanuel Candès. Deep knockoffs. Journal of the American Statistical Association 115.532 (2020): 18611872.
 [23] Barber, Rina Foygel, and Emmanuel J. Candès. Controlling the false discovery rate via knockoffs. Annals of Statistics 43.5 (2015): 20552085.
 [24] Gerhardus, Andreas, and Jakob Runge. Highrecall causal discovery for autocorrelated time series with latent confounders. arXiv preprint arXiv:2007.01884 (2020).
 [25] Granger, Clive WJ. Some recent development in a concept of causality. Journal of econometrics 39.12 (1988): 199211.
 [26] Vantas, Konstantinos, Epaminondas Sidiropoulos, and Chris Evangelides. Estimating Rainfall Erosivity from Daily Precipitation Using Generalized Additive Models. Environmental Sciences Proceedings. Vol. 2. No. 1. Multidisciplinary Digital Publishing Institute, 2020.
 [27] Geweke, John. Measurement of linear dependence and feedback between multiple time series. Journal of the American statistical association 77.378 (1982): 304313.
 [28] Lawrence, Andrew R., Marcus Kaiser, Rui Sampaio, and Maksim Sipos. Data generating process to evaluate causal discovery techniques for time series data. arXiv preprint arXiv:2104.08043 (2021).
 [29] Popescu, OanaIuliana, Maha Shadaydeh, and Joachim Denzler. Counterfactual Generation with Knockoffs. arXiv preprint arXiv:2102.00951 (2021).

[30]
Gimenez, Jaime Roquero, Amirata Ghorbani, and James Zou. Knockoffs for the mass: new feature importance statistics with false discovery guarantees. The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 2019.
 [31] Zheng, X., Aragam, B., Ravikumar, P. K. & Xing, E. P. in Advances in Neural Information Processing Systems Vol. 31 (eds. Bengio, S. et al.) 9472–9483 (Curran Associates, 2018).
 [32] Lachapelle, S., Brouillard, P., Deleu, T. & LacosteJulien, S. in Proc. Eighth Int. Conf. Learning Representations (ICLR, 2020).
 [33] Louizos C, Shalit U, Mooij J, Sontag D, Zemel R, Welling M. Causal effect inference with deep latentvariable models. arXiv preprint arXiv:1705.08821. 2017 May 24.

[34]
Trifunov, Violeta Teodora, Maha Shadaydeh, Jakob Runge, Veronika Eyring, Markus Reichstein, and Joachim Denzler. ”Nonlinear causal link estimation under hidden confounding with an application to time series anomaly detection.” In German Conference on Pattern Recognition, pp. 261273. Springer, Cham, 2019.
 [35] Barber, R.F., Candès, E.J. and Samworth, R.J., 2020. Robust inference with knockoffs. The Annals of Statistics, 48(3), pp.14091431.
 [36] Granger, C.W., 1969. Some recent developments in a concept of causality, Journal of Econometrics, vol. 39, 1988, pp. 199—211; and. Investigating causal relations by econometric models and crossspectral methods, Econometrica, 37, pp.42438.
 [37] Shadaydeh, Maha, Yanira Guanche Garcia, Miguel D. Mahecha, Markus Reichstein, and Joachim Denzler. ”Causality analysis of ecological time series: a timefrequency approach.” In 8th International Workshop on Climate Informatics: CI 2018, pp. 111114. 2018.
 [38] Shadaydeh, Maha, Joachim Denzler, Yanira Guanche García, and Miguel Mahecha. ”Timefrequency causal inference uncovers anomalous events in environmental systems.” In German Conference on Pattern Recognition, pp. 499512. Springer, Cham, 2019.
 [39] Luo Y, Peng J, Ma J. When causal inference meets deep learning. Nature Machine Intelligence. 2020 Aug;2(8):4267.
 [40] Pearl, J. and Mackenzie, D.. The Book of Why. Harlow, England: Penguin Books, 2019.
Comments
There are no comments yet.