# Telling cause from effect in deterministic linear dynamical systems

Inferring a cause from its effect using observed time series data is a major challenge in natural and social sciences. Assuming the effect is generated by the cause trough a linear system, we propose a new approach based on the hypothesis that nature chooses the "cause" and the "mechanism that generates the effect from the cause" independent of each other. We therefore postulate that the power spectrum of the time series being the cause is uncorrelated with the square of the transfer function of the linear filter generating the effect. While most causal discovery methods for time series mainly rely on the noise, our method relies on asymmetries of the power spectral density properties that can be exploited even in the context of deterministic systems. We describe mathematical assumptions in a deterministic model under which the causal direction is identifiable with this approach. We also discuss the method's performance under the additive noise model and its relationship to Granger causality. Experiments show encouraging results on synthetic as well as real-world data. Overall, this suggests that the postulate of Independence of Cause and Mechanism is a promising principle for causal inference on empirical time series.

Comments

There are no comments yet.

## Authors

• 4 publications
• 37 publications
• 2 publications
• 11 publications
• ### Causal Inference Using Linear Time-Varying Filters with Additive Noise

Causal inference using the restricted structural causal model framework ...
12/23/2020 ∙ by Kang Du, et al. ∙ 0

read it

• ### Causal Inference on Time Series using Structural Equation Models

Causal inference uses observations to infer the causal structure of the ...
07/21/2012 ∙ by Jonas Peters, et al. ∙ 0

read it

• ### Error Asymmetry in Causal and Anticausal Regression

It is generally difficult to make any statements about the expected pred...
10/11/2016 ∙ by Patrick Blöbaum, et al. ∙ 0

read it

• ### A Critical View of the Structural Causal Model

In the univariate case, we show that by comparing the individual complex...
02/23/2020 ∙ by Tomer Galanti, et al. ∙ 16

read it

• ### Detecting and Explaining Causes From Text For a Time Series Event

Explaining underlying causes or effects about events is a challenging bu...
07/27/2017 ∙ by Dongyeop Kang, et al. ∙ 0

read it

• ### A simple test for causality in complex systems

We provide a new solution to the long-standing problem of inferring caus...
05/04/2020 ∙ by Kristian Agasøster Haaga, et al. ∙ 0

read it

• ### Causal Inference in Nonverbal Dyadic Communication with Relevant Interval Selection and Granger Causality

Human nonverbal emotional communication in dyadic dialogs is a process o...
10/29/2018 ∙ by Lea Müller, et al. ∙ 1

read it

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A major challenge in the study of complex natural systems is to infer the causal relationships between elementary characteristics of these systems. This provides key information to understand the underlying mechanisms at play and possibly allows to intervene on them to influence the overall behavior of the system. While causal knowledge is traditionally built by performing experiments, boiling down to modifying a carefully selected parameter of the system and analyzing the resulting changes, many natural systems do not allow such interventions without tremendous cost or complexity. For example, it is very difficult to influence the activity of a specific brain region without influencing other properties of the neural system [15]. Causal inference methods have been developed to avoid such intervention and infer the causal relationships from observational data only [23, 18]. To be able to build such knowledge without interventions, these approaches have to rely on key assumptions pertaining to the mechanisms generating the observed data.

The framework of causality in [23, 18]

has originally addressed this question by modelling observations as i.i.d. random variables. However, observed data from complex natural system are often not i.i.d. and time dependent information reflects key aspects of those systems. Most causal inference methods for time series, including the most widely used Granger causality

[8], assume the data is generated from a stochastic model through a structural equation linking past values to future ones through an i.i.d. additive noise term, the “innovation of the process” [8, 19]

. While these methods can successfully estimate the causal relationships when empirical data is generated according to the model assumptions, the results can be misleading when the model is misspecified. In particular, this is the case when unknown time lags are introduced in the measured time series.

In this paper, we introduce a new approach to inferring causal directions in time series, the Spectral Independence Criterion (SIC). The idea behind SIC, as well as several new approaches to causal inference [7, 11, 10, 26], is to rely on the ‘philosophical’ principle that the cause and the mechanism that generates the cause from the effect are chosen independently by Nature. Thus, these two objects should not contain any information about each other [12, 14, 21]. Here, we refer to this abstract principle as the postulate of Independence of Cause and Mechanism (ICM). The above mentioned methods relying on ICM refer to different domains and rely on quite different formalizations of the concept of “independence”. SIC formalizes the ICM postulate in the context where both cause and effect are stationary time series and the cause generates the effect trough a linear time invariant filter. The SIC postulate assumes that the frequency spectrum of the cause does not correlate with the transfer function of the filter. This assumption is justified by its connection to the Trace Method [10] and by a generative model of the system. Under this postulate, we prove that SIC can tell the causal direction of the system from its anti-causal counterpart. Moreover, we elaborate on the connection between this novel framework and linear Granger causality, showing they are exploiting fundamentally different information from the observed data. In addition, superiority to Granger causality is shown analytically in the context of time series measurements perturbed by an unknown time lag. We perform extensive experimental comparisons, both on simulated and real datasets. In particular, we show that our approach outperforms Granger causality to estimate the direction of causation between to structures of rat hippocampus using Local Field Potential (LFP) recordings.

Overall, the proposed method offers a new approach to causal inference for time series data with identifiability results, and shows unprecedented robustness to measurement delays. The promising empirical results suggest the SIC postulate is a reasonable assumption for empirical data, and that it should be further exploited to develop novel causal inference techniques.

## 2 Spectral Independence Criterion (SIC)

### 2.1 Notations and model description

We refer to a sequence of real or complex numbers as a deterministic time series

. Its discrete Fourier transform is defined by

 ˆa(ν)=∑t∈Zatexp(−i2πνt),ν∈[−1/2,1/2]=:I

The energy of the deterministic time series is the squared norm:

. For ease of notation we will also use the Z-transform of

a

 ˜a(z)=∑t∈Zatz−t,z∈C

such that .

We assume that the causal mechanism is given by a (deterministic) Linear Time Invariant (LTI) filter. That is, the causal mechanism is formalized by the convolution

 y={yt}={∑τ∈Zxt−τhτ}=% x∗h, (1)

where h denotes the impulse response, x the input time series and y the output. We will assume that the filter satisfies the Bounded Input Bounded Output (BIBO) stability property [20], which boils down to the condition . Under this assumption, the Fourier transform is well defined and we call it the transfer function of the system.

We assume that the input time series x is a sample drawn from a stochastic process, . For a given index , represents the random variable at index . We use or simply X to represent the complete stochastic process. We use

to indicate the random vector corresponding to the restriction of the time series to the integer interval

. We use to indicate the random vector corresponding to the restriction of the time series to the integer interval . Assuming X is a zero mean stationary process (in this paper, stationary will always stand for weakly or wide-sense stationary [5]), we will denote by the autocovariance function of the process and assume it is absolutely summable. Then, we can define its Power Spectral Density (PSD) . Under these assumptions, the power of the process is finite and , such that belongs to . Moreover, we recall the following basic properties for our model:

###### Proposition 1.

Assume the weakly stationary input X is filtered by the BIBO linear system of impulse response h to provide the output Y. Then , and Y is weakly stationary with summable autocovariance such that

 Syy(ν)=|ˆh(ν)|2Sxx(ν),ν∈I (2)
###### Proof.

Results from elementary properties of the Fourier transform and Proposition 3.1.2. in [5]. ∎

If such a linear filtering relationship exists for X as input and Y as output, but not in the opposite way, we can use this information to infer that X is causing Y and not the other way round. If there are such impulse responses exist for both directions, say and , their Fourier transforms are related by

 ˆhX→Y=1ˆhY→X,

and we have to resort to a more refined criterion for the causal inference. We will assume this situation in the remaining of the paper.

### 2.2 Definition of SIC

Assume we are given the two processes and . Moreover, we assume that exactly one of the following two alternatives is true: (1) X causes Y or (2) Y causes X. We assume that there are no unobserved common causes of X and Y. Our causal inference problem thus reduces to a binary decision. In the spirit of ICM, we assume that in case (1), X and h should not contain information about each other and our Spectral Indpendance Criterion (SIC) assumes that the input power does not correlate with the amplifying factor, that is,

 ⟨Sxx|ˆh|2⟩=⟨Sxx⟩⟨|ˆh|2⟩, (3)

where denotes the average over the unit frequency interval . Note that the left hand side of (eq. 3) is the average intensity of the output signal over all frequencies. Hence, SIC states that the average output intensity is the same as amplifying all frequencies by the average amplifying factor. To motivate why we call (eq. 3) an independence condition we note that the difference between the left and the right hand side can be written as a covariance:

 ⟨Sxx⋅|ˆh|2⟩−⟨Sxx⟩⟨|ˆh|2⟩=Cov(Sxx,|ˆh|2).

were we consider and as functions of the random variable uniformly distributed on . As a consequence statistical independence between those random variables implies that (eq. 3) is satisfied.

Note that the criterion (eq. 3) can be rephrased in terms of the power spectra of X and Y alone using (eq. 2), which are closer to observable quantities than :

###### Postulate 1 (Spectral Independence Criterion).

If Y is generated from X by a linear deterministic translation invariant system then we have:

 ⟨Syy⟩=⟨Sxx⟩⟨Syy/Sxx⟩. (4)

### 2.3 Quantifying violation of SIC

This motivates us to define a measure of dependence between the input PSD on one hand and transfer function of the mechanism on the other hand. To asses to what degree such a relation holds we introduce a scale invariant expression , that we call the spectral dependency ratio (SDR) from X to Y:

 ρX→Y:=⟨Syy⟩⟨Sxx⟩⟨Syy/Sxx⟩ (5)

Here, the value means independence, which becomes more obvious by rewriting (eq. 5) as

 ρX→Y=Cov[Sxx,|ˆh|2]⟨Sxx⟩⟨Syy/Sxx⟩+1.

Finally, we note that can be written in terms of total power and energy:

 ρX→Y=P(Y)P(X)||h||22

We then define by exchanging the roles of X and Y:

 ρY→X:=⟨Sxx⟩⟨Syy⟩⟨Sxx/Syy⟩ (6)

### 2.4 Identifiability results

In order to identify the true causal direction from SIC, it is necessary to show that and take characteristic values that are informative about this inference problem. The following first result shows explicitly how dependence measures in both directions are related:

###### Proposition 2.

(Forward-backward inequality) For a given linear filter with input PSD , output PSD and a non-constant modulus transfer function we have

 ρX→Y.ρY→X<1. (7)

Moreover, if ,

then

 ρX→Y.ρY→X≤⎡⎢⎣1+α∫I(|ˆh(ν)|2−∥h% ∥22∥h∥22)2dν⎤⎥⎦−1<1. (8)

Proof of this proposition is given in supplementary material. Note that corresponds to the mean value of the transfer function due to Parseval’s theorem. According to equation eq. 8, the less constant is, the more the product of the independence measures will be inferior to . Assuming the SIC postulate is satisfied in the forward direction such that , it follows naturally that . The two causal directions can thus be distinguished well whenever the transfer function deviates significantly from its mean value such that is bounded away from 1. We then infer the causal direction to be the one with the largest value.

To further support that SDR values can be used empirically for causal inference, we need the SIC postulate to be approximately satisfied (see (eq. 4)) in systems generated according to the ICM principle. We now describe a model where h is generated by some random process, independently of X. To this end, assume we start with a Finite Impulse Response (FIR) h, that is, for all , for some . Then h is given by real numbers such that

 hi=bii=0,…,m−1.

We then apply an orthogonal transformation U, randomly drawn from the orthogonal group according to the ‘uniform distribution’ on , that is, the Haar measure. In this way, we generate a new impulse response function

 h′i:=(Ub)ii=0,…,m−1. (9)

Since orthogonal transformations preserve the Euclidean norm by definition, they preserve the energy of the filter. Our procedure thus chooses a random filter among the set of filters having the same support of length and the same energy. We now show that for large

the resulting filter will approximately satisfy SIC with high probability:

###### Theorem 1.

(concentration of measure for FIR filters) For some fixed , let be the dependence measure obtained for in (eq. 9). If is chosen from the Haar measure on , then for any given

 |ρUX→Y−1|≤2εP(X)maxνSxx(ν).

with probability where is a positive global constant independent of , , X and Y.

Proof of this theorem is provided in supplementary material. This result provides a justification for using SIC provided that the dimension of the vector of filter coefficients is large enough. The relevance of will be investigated in practice in the experimental section.

### 2.5 Relation to the Trace Condition

We now describe the relation between SIC and a causal inference tool called Trace Method [10]. Let and be -dimensional variables, related by the linear structural equation

 Y=AX+E,

where is an structure matrix and is a -dimensional noise variable independent of . [10] postulate the following independence condition between the covariance matrix of input distribution and :

###### Postulate 2 (Trace Condition).
 τm(AΣXAT)=τn(ΣX)τn(ATA), (10)

approximately, where denotes the renormalized trace .

The postulate can be justified by random matrix theory with large

when and are independently chosen according to priors satisfying appropriate symmetry assumptions [10]. In the association between SIC and trace method we only consider square matrices and therefore .

To quantify the violation of (eq. 10) we introduce the following quantity:

###### Defnition 1 (Tracial Dependency Ratio (TDR)).

The tracial dependency ratio is given by

 rX→Y:=τn(AΣXAT)τn(ΣX)τn(ATA). (11)

We thus can see that the tracial ratio plays a role analog to our spectral dependency ratio in the finite dimensional case. We can actually show that SIC can be viewed as a limit case of the Trace Condition by defining the following truncated system.

###### Defnition 2.

To any given infinite dimensional linear system , the truncated system of order is defined by zeroing the input and the output values for integers such that :

 X′N=X−N:N−1↦Y′N=(h∗X′N)−N:N−1,

Note that in this definition for each , the vectors are inherently different. The mapping defined in this way is linear and can be written as with , such that the trace method can be applied to it. We then have the following result showing that SIC can be obtained from the Trace Condition as an appropriate limit:

###### Theorem 2.

Let represent the tracial ratio for the truncated systems of order for a given linear system with SDR . Then

 limN→∞rX′N→Y′N=ρX→Y

The proof, together with two necessary lemmas is available in supplementary material.

## 3 SIC for vector autoregressive models

SIC and Granger causality rely on completely different assumptions but both apply to linear time series models. In this section, we study the classical Vector Autoregressive (VAR) model used in Granger causality from the SIC perspective to better understand the relation.

### 3.1 VAR model

We assume the observed time series are generated by a VAR model such that Granger causes .

 Xt = ∑kakXt−k+ϵt (12) Yt = ∑kbkYt−k+∑kckXt−k+ξt (13)

Both noise terms and in this expression are i.i.d normal noises.

### 3.2 Applying SIC to VAR models

We want to rewrite this expression such that Y is obtained from X by a deterministic linear time invariant filter. We observe that the VAR model can be cast as linear time invariant filter if we neglect the additive noise . Indeed, then the mechanism is the following ARX (AutoRegressive with eXogenous input) model [13].

 Yt=∑kbkYt−k+∑kckXt−k (14)

Using basic properties of the Z-transform, we can derive the following analytic expressions of the input PSD :

 Sxx(ν)=|ˆn(ν)|2=|˜n(exp(2πiν))|2,

with

 ˜n(z)=11−∑kakz−k.

Moreover, the transfer function corresponding to the mechanism in equation eq. 14 is

 ˜m(z)=∑kckz−k1−∑kbkz−k

As a consequence, testing SIC on the VAR model in the forward direction amounts (when neglecting the filtered noise ), to test independence between

 |ˆh(ν)|2=|˜m(exp(2πiν))|2 (15)

and

 Sxx(ν)=|~n(exp(2πiν)|2, (16)

which are parametrized by the coefficients and respectively. We conjecture that a concentration of measure result similar to Theorem 1 holds stating that independent choice of the coefficients from an appropriate symmetric distribution typically yields small correlations between (eq. 15) and (eq. 16). This will be tested empirically in the Experiments section. Additionally, the robustness of our approach to noise in the VAR model will be addressed extensively in a longer version of this manuscript.

### 3.3 Comparison of SIC and Granger causality

The bivariate VAR model above is the typical model where Granger causality works. To recall the idea of the latter, note that it infers that there is an influence from X to Y whenever predicting Y from its past is improved by accounting for the past of X. Rephrasing this in terms of conditional independences, X is inferred to cause Y whenever is not conditionally independent of , given . Within the context of the above linear model, knowing

reduces the variance of

, given because then the noise is the only remaining source of uncertainty. Without knowing , we have additional uncertainty due to the contribution of .

SIC, on the other hand, does not rely on detecting whether X helps in improving the prediction of Y. As demonstrated above, SIC applied to a bivariate VAR model boils down to quantifying independence between two linear filters defined by set of coefficients, the filter generating the input with transfer function and the filter of the mechanism with transfer function . This is a completely different concept. One can easily imagine that the coefficients and can be hand-designed such that the functions (eq. 15) and (eq. 16) are correlated. This would spoil SIC, but leave Granger unaffected. On the other hand, the following subsection describes a scenario where Granger fails but SIC still works.

### 3.4 Sensitivity to Time Lag

Consider two time series and where

is a white noise and

 ∀t∈Z,\indentYt=cYt−1+Xt−1,

for a given . It can be easily seen that this type of input and output can be simulated using an IIR filter with and in (eq. 17) and the rest of the coefficients are zero (please refer to the definition of coefficients in section section 4.1). The infinite DAG for this causal structure can be seen in fig. 1.

Now if there would be a measurement delay of length for Y, the observed values will be a new time series, say , where . Although the ground truth is independent of , Granger causality only infers the correct causal structure if (where there is a lag in measurement of X, but not Y). However SIC always infers the correct direction (except when and the time structure is spoiled). This is because the PSD of the white noise X is constant and depends only on the total power, i.e,

 Sxx(ν)=Var(Xt)=P{X},

for all . and obviously, this constant remains the same for the lagged time series. Thus, SIC correctly identifies the causal structure (except when in which case the dependence to time is completely spoiled).

## 4 Experiments

In this section we study our causal inference algorithm using synthetic experiments and apply it to several real world data sets.

### 4.1 Synthetic Data: ARMA filters and processes

We designed synthetic experiments to assess the validity of the SIC approach. The data generating process is as follows. The LTI system modeling the mechanism is chosen among the family of ARMA(,) filters with parameters defined by input-output difference equation:

 yn=1a0(FO∑i=0bixn−i+BO∑j=1ajyn−j). (17)

For these filters is known as the feedforward order and is the feedback order. ’s and ’s are known as feedback and feedforward coefficients respectively. Note that when , the system is called and autoregressive filter. Alternatively, corresponds to the family of Finite Impulse Response (FIR) or Moving Average filters. Whenever , the filter has Infinite Impulse Response (IIR). The input of the causal model will be chosen among the family of ARMA(,) processes, which are generated by filtering an i.i.d noise input with an ARMA(,) filter. We thus chose two filters and , with parameters and respectively. To simulate a cause effect pair X,Y, we generated the cause X by applying

to a normally distributed i.i.d noise. Then, we generated

Y by applying to X. The feedforward and feedback orders of both systems and were chosen identical in all experiments.

In each trial all the elements of vectors a, , b and except the first ones (i.e.

which were fixed to one) were sampled from an isotropic multidimensional Gaussian distribution with variance

. Coefficients are sampled using rejection sampling such that only BIBO-stable filters are kept.

We simulated sequences of length . The PSD of X and Y were estimated using Welch’s method [24]. We repeated this experiment times. Figure fig. 2 shows an example of the distribution of and and of their difference using .

The SDR for the correct direction of is concentrated around one, while in the wrong direction the estimator stays less inferior to one for most of its probability mass (in this example ). This results in a positive difference between SDR for most of the probability mass. Accordingly, our inference algorithm based on the sign of this difference algorithm 1 will select the correct direction in most of the cases.

Based on this inference algorithm, we test the effect of the filter orders on the performance of the method, where we evaluate the performance of each setting of and over 1000 trials. We varied the orders between and and compared the performance of the cases , and . Considering that the experiments are independent and based on the assumption that our method is successful with probability where

has a binomial distribution, we calculated confidence intervals using Wilson’s score interval

[25] where (and therefore ). The performance increases rapidly with filter order, as can be seen in the plots of  fig. 3. Moreover, the feedforward filter coefficients seem the most beneficial to the approach, since their absence leads to the worst performance ( fig. 3 red line).

### 4.2 Real World Examples

We tried our method over several examples of real data where the ground truth about the causal structure of the data is known a priori and the data is labeled in a way that the ground truth is . In the first two examples we plotted the difference of SDR in both directions as a function of the window length used in Welch method which can be seen in   fig. 4.

#### 4.2.1 Gas Furnace [4]

This dataset consists in 296 time points, with X the gas rate consumed by a gas furnace and Y the produced rate of .  fig. 4 shows against the window length, which was ranging from to points. As illustrated, the difference is always positive and our method is able to correctly infer the right causal direction independent of window length. TiMiNO and Granger causality correctly identified the ground truth in this case as well [19].

#### 4.2.2 Old Faithful Geyser [2]

: X contains the duration of an eruption and Y is the time interval to the next eruption of the Old Faithful geyser. Figure fig. 4 represents the difference in SDRs as a function of window length with the same configuration as the gas furnace experiment. Again the correct causal direction is inferred by our method independently from the window length as illustrated in  fig. 4. In this case TiMiNO correctly identifies the cause from effect but neither linear nor non-linear Granger causality infer the correct causal direction [19].

#### 4.2.3 LFP recordings of the Rat Hippocampus

It is known that contrary to neocortex where connectivity between areas is bidirectional, monosynaptic connections between several regions of the hippocampus are mostly unidirectional [1]. An important example of such connectivity is between the CA3 and CA1 subfields [1]. Despite this anatomical fact, a study of causality based on Local Field Potential (LFP) recordings of CA1 and CA3 of the hippocampus of the rat during sleep reports that Granger causality infers strong bidirectional relations between the two areas [3]. [3] explains the possible reasons of such result as feedback loops involving cortex and medial septum, and diffuse connections going from CA1 to CA3.

To do a comparison with Granger causality, we applied our framework to recordings from those regions using a publicly available dataset [16, 17]. LFP’s were recorded using a shank probe having channels downsampled to Hz. Shanks were attributed by experimentalists to the CA1 and CA3 areas (leaving channels for each area). For more information on the details of gathered data please refer to [17]. We used the data for rat “vvp01” during a period of sleep and a period of active behaviour in a linear environment. We applied linear Granger causality using an implementation from the statsmodel Python library222Statsmodels: Statistical library for Python. More details on null hypothesis for Granger causality can be found on the website.. We considered a forced decision scheme for Granger causality (to make it comparable to our method), were we select the correct Granger causal direction as the one having the lowest

-value for the null hypothesis of absence of causal influence. Following the usual methodology of causality analysis

[3, 6] we divided the duration of ten minutes into intervals of two seconds () to reduce the effect of nonstationarity in data analysis, and performed SIC causal inference on each interval for each electrode pair. We took two different approaches to report assess the performance of methods: one, based on a majority vote over all intervals for each channel pair, and two, by assessing the average performance based on individual time intervals. The results are plotted as histograms in   fig. 5 and they show that SIC clearly outperforms Granger causality on this dataset. The confidence intervals are once again based on Wilson score but obviously this time the in dependancy assumption between the trials is not well justified, specially for pooling all the results.

#### 4.2.4 Characterizing the Echo

The echo effect of a room over a sound generated in the room can be well estimated by a convolution of the real signal with a function known as room impulse response function. In this experiment we used an open source database of room Impulse Response Function (IRF) available at the Open AIR library. We chose the IRFs for Elevden Hall, Elevden, Suffolk, England and Hamilton Mausoleum, Hamilton, Scotland. We convolved these signals with seconds segments of two classical music pieces: the first movement of Vivaldi’s Winter Concerto consisting of data points, and the Lacrimosa of Mozart’s Requiem, consisting of points, both ‘.wav’ files with the rate of Hz. Regardless of the segment the SDR in forward direction is considerably larger than the SDR in the backward direction as can be seen in fig. 6.

In another experiment we used a computer to play the musical pieces above in an academic Lecture Hall (labelled as “Hall” in plots) and in an office room (labelled as “Room” in plots) and recorded the echoed version in the environment. In a series of different tests, we split the data into pieces, and we ignored the last piece so that all the pieces would have an equal length. In each test we averaged the performance of our causal inference method over all the segments and plotted this performance against the size of the window length in Welch method. The window size was varied between and half of the length of the music segment length (which is dependent on the number of segments). The results can be found in  fig. 7 and show a very good performance of the approach for large window lengths .

## 5 Conclusion

We have introduced a causal discovery method for time series based on the SIC postulate, assuming a LTI relationship for a given pair of time series X and Y, such that either or . Theoretical justifications are provided for this postulate to lead to identifiability. Interestingly, the method provides and extension of the recently proposed Trace Method approach to the time series setting. Encouraging experimental results have been also presented on real world and synthetic data. Specially this method proved to be more effective than linear Granger causality on LFP recordings from CA1 and CA3 hippocampal areas of rat’s brain, assuming a ground truth causal direction from CA3 to CA1 based on anatomy. We suggest that this method can provide a new perspective for causal inference in time series based on assumptions fundamentally different from Granger causality. We will address the existence of confounders, establish a statistical significance test (for example using a procedure inspired by [26]), and extend this method to multivariate time series in future work.

## References

• [1] P. Andersen, R. Morris, D. Amaral, T. Bliss, and J. O’Keefe. The hippocampus book. Oxford University Press, 2006.
• [2] A Azzalini and AW Bowman. A look at some data on the old faithful geyser. Applied Statistics, pages 357–365, 1990.
• [3] LA Baccala, K Sameshima, G Ballester, AC Do Valle, and C Timo-Iaria. Studying the interaction between brain structures via directed coherence and granger causality. Applied Signal Processing, 5(1):40, 1998.
• [4] G EP Box, G M Jenkins, and G C Reinsel. Time series analysis: forecasting and control. John Wiley & Sons, 2013.
• [5] Peter J Brockwell and Richard A Davis. Time series: theory and methods. Springer Science & Business Media, 2009.
• [6] Alex J Cadotte, Thomas B DeMarse, Thomas H Mareci, Mansi B Parekh, Sachin S Talathi, Dong-Uk Hwang, William L Ditto, Mingzhou Ding, and Paul R Carney. Granger causality relationships between local field potentials in an animal model of temporal lobe epilepsy. Journal of neuroscience methods, 189(1):121–129, 2010.
• [7] P Daniusis, D Janzing, J Mooij, J Zscheischler, B Steudel, K Zhang, and B Schölkopf. Inferring deterministic causal relations.

P. Grünwald and P. Spirtes (Editors). Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence (UAI 2010). AUAI Press. ISBN 978-0-9749039-6-5.

• [8] CWJ Granger. Investigating causal relations by econometric models and cross-spectral methods. Econometrica: Journal of the Econometric Society, pages 424–438, 1969.
• [9] R M Gray. Toeplitz and circulant matrices: A review. now publishers inc, 2006.
• [10] D Janzing, PO Hoyer, and B Schölkopf. Telling cause from effect based on high-dimensional observations. In Johannes Fürnkranz and Thorsten Joachims, editors,

Proceedings of the 27th International Conference on Machine Learning (ICML-10)

, pages 479–486, Haifa, Israel, June 2010. Omnipress.
• [11] D Janzing, J Mooij, K Zhang, J Lemeire, J Zscheischler, P Daniušis, B Steudel, and B Schölkopf. Information-geometric approach to inferring causal directions. Artificial Intelligence, 182:1–31, 2012.
• [12] D Janzing and B Schölkopf. Causal inference using the algorithmic Markov condition. Information Theory, IEEE Transactions on, 56(10):5168–5194, 2010.
• [13] K J Keesman. System identification: an introduction. Springer Science & Business Media, 2011.
• [14] J. Lemeire and D. Janzing. Replacing causal faithfulness with algorithmic independence of conditionals. Minds and Machines, pages 1–23, 7 2012.
• [15] N. K. Logothetis, M. Augath, Y. Murayama, A. Rauch, F. Sultan, J. Goense, A. Oeltermann, and H. Merkle. The effects of electrical microstimulation on cortical signal propagation. Nature Neuroscience, 13:1283–1291, 2010.
• [16] K Mizuseki, A Sirota, E Pastalkova, and G Buzsáki. Theta oscillations provide temporal windows for local circuit computation in the entorhinal-hippocampal loop. Neuron, 64(2):267–280, 2009.
• [17] K Mizuseki, A Sirota, E Pastalkova, K Diba, and G Buzsáki. Multiple single unit recordings from different rat hippocampal and entorhinal regions while the animals were performing multiple behavioral tasks. crcns.org, 2006.
• [18] J Pearl. Causality: models, reasoning and inference, volume 29. Cambridge Univ Press, 2000.
• [19] J Peters, D Janzing, and B Schölkopf. Causal inference on time series using restricted structural equation models. In Advances in Neural Information Processing Systems, pages 154–162, 2013.
• [20] J G Proakis. Digital signal processing: principles algorithms and applications. Pearson Education India, 2001.
• [21] B. Schölkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, and J. Mooij. On causal and anticausal learning. In John Langford and Joelle Pineau, editors, Proceedings of the 29th International Conference on Machine Learning (ICML-12), pages 1255–1262, New York, NY, USA, 2012. ACM.
• [22] D. Serre. Matrices: Theory and Applications. Graduate Texts in Mathematics. Springer, 2010.
• [23] P. Spirtes, C. Glymour, and R. Scheines. Causation, prediction, and search (Lecture notes in statistics). Springer-Verlag, New York, NY, 1993.
• [24] P D Welch. The use of fast fourier transform for the estimation of power spectra: a method based on time averaging over short, modified periodograms. IEEE Transactions on audio and electroacoustics, 15(2):70–73, 1967.
• [25] E B Wilson. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association, 22(158):209–212, 1927.
• [26] J Zscheischler, D Janzing, and K Zhang.

Testing whether linear equations are causal: A free probability theory approach.

In UAI’11, pages 839–846, 2011.

## Supplementary Material

We have prepared an appendix to address the proofs for Proposition 2, theorems 2 and 1 which we provide in two different sections. For this purpose we will use a few extra notations which we define here. We will also use and interchangeably for the normalized trace of a square matrix of order .

## 6 Proof of Proposition 2

###### Lemma 1.

For positive, non-constant, such that , we have

 ∫If(x)2dx.∫I1f(x)2dx>1
###### Proof.

Using Cauchy-Schwartz inequality for the scalar product

 ⟨f(x),1f(x)⟩=∫If(x).1f(x)dx=1.

Inequality is strict since and are not collinear (otherwise would be constant). ∎

###### Lemma 2.

Let be positive, non-constant, such that and .

Assume ,

then

 ∫If(x)dx.∫I1f(x)dx≥1+α∫I(f(x)−1)2dx
###### Proof.

We denote . Then and

 ∫If(x)dx.∫I1f(x)dx−1=∫I−s(x)1+s(x)dx

For , we have

 −x1+x≥x2−x3−x. (18)

The function on the l.h.s. of (eq. 18) is convex because its second order derivative is positive and using its tangent in , we get

 ∫If(x)dx.∫I1f(x)dx−1≥∫Is(x)2(1−s(x))dx

Since ,

 ∫If(x)dx.∫I1f(x)dx−1≥α∫Is(x)2dx

###### Proof of Proposition 2.

By using the definition of Spectral Dependency Ratios and Lemma 1 we get

 ρX→YρY→X=1⟨|^h|2⟩⟨1/|^h|2⟩<1

Moreover, applying Lemma 2 to we get inequality eq. 8. ∎

## 7 Proof of Theorem 1

To prove this theorem we rely on a theorem from [10] and a corollary that we derive from it.

###### Theorem 3 (concentration of measure for finite dimensional linear relationships).

[10] Suppose is a given covariance matrix and suppose is also a given matrix. Then if one generates

by uniformly choosing an orthogonal matrix

from then together with , satisfies trace condition in probability when tends to infinity. More precisely for a given there exist , being a constant where

 |τm(AΣXA⊤)−τn(ΣX)τm(AA⊤)|=|τm(AUΣU⊤A⊤)−τn(Σ)τm(AA⊤)|≤2ε∥Σ∥∥AA⊤∥

holds with probability .

In the above theorem (and the rest of the document), applied to a matrix will refer to the operator norm. The following corollary is a direct consequence of the previous theorem:

###### Corollary 1.

Suppose is a given covariance matrix and suppose is also a given matrix. Then if one generates by uniformly choosing an orthogonal matrix from then together with , satisfies trace condition in probability when tends to infinity More precisely for a given there exist , being a constant where

 |τm(AUΣA⊤U)−τn(ΣX)τm(AA⊤)|= |τm(AUΣU⊤A⊤)−τn(Σ)τm(AA⊤)|≤2ε∥Σ∥∥AA⊤∥

holds with probability .

To prove the main theorem we will also need two lemmas that are stated below.

###### Lemma 3.

[22] For a given Hermitian matrix and any principal submatrix of , , their spectral radius satisfies

 ρs(H)≥ρs(H′).
###### Lemma 4.

[9] Let be a bounded function and suppose is its Fourier series coefficients, i.e.

 tk=∫12−12f(ν)ei2πkνdν,    t∈Z.

Consider Toeplitz matrices defined as

 [Tn]ij=ti−j    i,j∈{0,...,n−1}

with eigenvalues

. Then if are absolutely summable we get:

 minx∈[−12,12)f(x)≤τn,i≤maxx∈[−12,12)f(x)
###### Proof of Theorem 1.

Without loss of generality and for the sake of simplicity we only consider the positive indices of the time series and we take the filter to be causal; other cases can be treated in a similar way. Then the following relation holds between input and output of the filter:

 ∀i,    0≤i≤N−1    Yi=m−1∑j=0bjXi−j

Formulated in terms of matrices the above relation can be represented as

 ⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣Y0Y1⋮YN−2YN−1⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦=B⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣X−m+1X−m+2⋮XN−2XN−1⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦,

where is a matrix as follows:

 ⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣bm−1bm−2⋯b00⋯000bm−1⋯b1b0⋯00⋱00⋯bm−1⋯b1b0000⋯0bm−1⋯b1b0⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦

We define to be the covariance matrices as follows:

 ∀i    0≤i≤N−1    0≤j,k≤m−1    [ΣiX]jk= Cov(Xi+j,Xi+k)

Since the time series that we are dealing with are weakly stationary it is obvious that is independent of . If we take to be the covariance matrices for and respectively, then we have

 ΣY0:N−1=BΣX−m+1:N−1B⊤

Also define to be the covariance matrix of the output for FIR with . Also assume the spectrum of the output for this filter is . One can write diagonal elements of and based on the above equation as follows:

 [ΣY0:N−1]ii=b⊤ΣiX% b,    [ΣUY0:N−1]ii=b⊤UΣiXU⊤b

and therefore the normalized traces of and can be written as

 τN(ΣY0:N−1)=1Nb⊤N−1∑i=0ΣiXb, τN(ΣUY0:N−1)=1Nb⊤UN−1∑i=0ΣiXU⊤b

Define . Taking in corollary corollary 1 for a randomly selected we get

 |1Nb⊤UΣU⊤b−1Nτm(Σ)⟨b,b⟩|≤2ε∥Σ∥√⟨b,b⟩

and therefore

 |τN(ΣUY0:N−1)−1Nτm(Σ)∥b∥22|≤2ε∥Σ∥∥b∥22 (19)

with probability . On the other hand the elements of diagonals of ’s are . Therefore:

 1Nτm(Σ)=mNCX(0)mN=P(X)

Since ’s are principal submatrices of therefore by corollary lemma 3

 ∥Σ∥=ρ(Σ)=∥1NN−1∑i=0ΣiX∥≤1NN−1∑i=0∥ΣiX∥≤ρ(ΣX0:N−1).

Because ’s are absolutely summable we apply lemma lemma 4 and we get

 ρ(ΣX0:N−1)≤maxνSxx(ν),

such that inequality eq. 19 can be rewritten

 |τN(ΣUY0:N−1)P(X)∥b% ∥22−1|≤2εP(X)∥Σ∥

which completes the proof. ∎

## 8 Proof of Theorem 2

In this section we give a proof that the TDR (see eq. eq. 11) asymptotically approaches the SDR (see eq. eq. 5). We first state and prove two lemmas that are used to derive this result. As before suppose and are given input and output of an LTI filter that are related through the impulse response function . According to the definition of the truncated linear systems (see definition definition 2) of order for the linear system above we get the following matrix relationship:

 ⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣Y′−NY′−N+1⋮Y′N−2Y′N−1⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣h0h−1⋯h−2N+1h1h0⋯h−2N+2⋮h2N−2h2N−3⋯h−1h2N−1h2N−2⋯h0⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣X−NX−N+1⋮XN−2XN−1⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦.

If we name the vector on the left as , the matrix as and the right vector as then the associated TDR yields:

 rxN→yN=τN(ΣyN)τN(ΣxN)τ2N(HNHNT) (20)

Define Now we show that converges to the energy of the impulse response.

###### Lemma 5.

Assume , then

 limN→+∞TN=∥h∥22
###### Proof.

First lets simplify the expression for :

 TN:=τ2N(HNHN⊤)=12N∑i,j[HN]2ij=2N−1∑k=−2N+1|hk|22N−|k|2N= −1∑k=−2N+1|hk|22N−|k|2N+2N−1∑k=0|hk|22N−|k|2N. (21)

It is easy to see that is an increasing sequence of . Moreover it is bounded by

 ∞∑−∞|hk|2