# A Sample Path Measure of Causal Influence

We present a sample path dependent measure of causal influence between two time series. The proposed measure is a random variable whose expected sum is the directed information. A realization of the proposed measure may be used to identify the specific patterns in the data that yield a greater flow of information from one process to another, even in stationary processes. We demonstrate how sequential prediction theory may be leveraged to obtain accurate estimates of the causal measure at each point in time and introduce a notion of regret for assessing the performance of estimators of the measure. We prove a finite sample bound on this regret that is determined by the regret of the sequential predictors used in obtaining the estimate. We estimate the causal measure for a simulated collection of binary Markov processes using a Bayesian updating approach. Finally, given that the measure is a function of time, we demonstrate how estimators of the causal measure may be extended to effectively capture causality in time-varying scenarios.

## Authors

• 7 publications
• 10 publications
• ### Measuring Sample Path Causal Influences with Relative Entropy

We present a sample path dependent measure of causal influence between t...
10/11/2018 ∙ by Gabriel Schamberg, et al. ∙ 0

• ### On the Bias of Directed Information Estimators

When estimating the directed information between two jointly stationary ...
02/01/2019 ∙ by Gabriel Schamberg, et al. ∙ 0

• ### Learning Network of Multivariate Hawkes Processes: A Time Series Approach

Learning the influence structure of multiple time series data is of grea...
03/14/2016 ∙ by Jalal Etesami, et al. ∙ 0

• ### Causal Compression

We propose a new method of discovering causal relationships in temporal ...
11/01/2016 ∙ by Aleksander Wieczorek, et al. ∙ 0

• ### Quantifying causal contribution via structure preserving interventions

We introduce 'Causal Information Contribution (CIC)' and 'Causal Varianc...
07/01/2020 ∙ by Dominik Janzing, et al. ∙ 0

• ### Assessing the causal effects of a stochastic intervention in time series data: Are heat alerts effective in preventing deaths and hospitalizations?

We introduce a new causal inference framework for time series data aimed...
02/20/2021 ∙ by Xiao Wu, et al. ∙ 0

• ### Is Information Theory Inherently a Theory of Causation?

Information theory gives rise to a novel method for causal skeleton disc...
10/05/2020 ∙ by David Sigtermans, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

In 1969, Granger [granger1969investigating]

built upon the the ideas of Wiener by proposing an approach to identify causal relationships between time series. While his original treatment was applied only to linear regression models, his underlying perspective that a time series

is “causing” if we can better predict given all information than given all information excluding is still utilized throughout causality research. More modern information theoretic interpretations of this principle include directed information (DI) [marko1973bidirectional, massey1990causality] and transfer entropy (TE) [schreiber2000measuring], which is equivalent to Granger causality (GC) for Gaussian autoregressive processes [barnett2009granger]. Both of these quantities measure the reduction in uncertainty (i.e. conditional entropy) of the future of that is obtained by including the past of in the available information in an appropriate sense. Interestingly, both quantities are determined by taking expectations over all sequences, and thus are dependent solely on a system’s underlying distribution and not a given realization of the collection of processes.

These quantities may be adjusted to incorporate a notion of locality through use of self-information. For a given realization of a random variable , the self-information is given by and represents the amount of surprise associated with that realization. By replacing entropy with self-information, and its conditional form , local versions of DI, TE, and their conditional extensions may be obtained (see Table 1 in [lizier2014jidt]). While the local extensions of DI and TE are indeed dependent on realizations, they may take on negative values. Such a scenario occurs when the knowledge that makes the observation of less likely to have occurred, i.e. . While an interesting concept, it is not clear how to interpret negative values in the context of assessing the presence/absence of a causal link.

As such, estimating the causal structure (i.e. directed graph) of a collection of processes typically involves estimating an averaged measure such as DI [quinn2015directed, amblard2011directed]. In a time varying scenario, however, it would be necessary to replace an expectation over time with some sort of windowing technique as in [oselio2017dynamic]. As a result, estimates of this form may be able to capture changes in the underlying system model, but do not reflect the varying levels of causal influence that occur within windows of time for which there is stationarity (see Example II.1). Here we build on the causal inference perspective presented in [kim2014dynamic] and propose a causal measure that captures changes in time without requiring a windowing approach. Furthermore, we introduce a framework for estimating the causal measure using sequential prediction and derive a finite sample bound on the accuracy of such estimators.

## Ii Sample Path Measure of Causal Influence

Suppose we observe the stochastic processes , , and

, characterized by the joint probability mass function (pmf)

. Although this work applies more generally, for the purpose of exposition we only consider discrete probability measures. We begin by considering the scenario where, having observed , we wish to determine the causal influence that has on the next observation . In such a scenario, we consider the following restricted (denoted ) and complete (denoted ) conditional distributions:

 f(r)Xi(xi) ≜fXi∣Xi−1,Zi−1(xi∣xi−1,zi−1) (1) f(c)Xi(xi) ≜fXi∣Xi−1,Yi−1,Zi−1(xi∣xi−1,yi−1,zi−1). (2)

Using these distributions, at each time we define the sample path measure of causality from to in the presence of side information for given realizations as:

 CY→X(xi−1,yi−1,zi−1)=D(f(c)Xi∣∣f(r)Xi). (3)

For ease of notation, we may represent the causal measure at time simply as .

The key observation that must be made is that and are determined by the realizations of , , and . As a result, the causal measure is a random variable. In this regard, our causal measure is different from previous measures of causality wherein the causal influence is determined by the model, and not the sample path. To ensure this point is made clear, we will present an example.

###### Example II.1.

Suppose iid for and:

 Xi∼{Bern(0.9),Yi−1=1Bern(0.5),Yi−1=0

Intuitively, we would expect that in some sense is “causing” to a greater extent when is one than when it is zero. In order to formalize this, we have to find the probability of when only is known (i.e. the restricted distribution):

 P( Xi=1|Xi−1=xi−1) =P(Xi=1) =∑yi−1∈{0,1}P(Xi=1∣Yi−1=yi−1)P(Yi−1=yi−1) =(0.5)(0.8)+(0.9)(0.2) =0.58.

We can fully characterize the complete and restricted probability mass functions (pmfs) using these probabilities, i.e. , if , and if . We can now compute the causal measure, which takes on one of two values determined by the observation :

 CY→X(i)={0.363,yi−1=10.019,yi−1=0

Thus, we see that our measure captures how, even in a stationary Markov chain, different patterns in the observed data may give rise to different levels of causal influence. By contrast, we note that because the process is stationary, the directed information rate and transfer entropy are both given simply by .

The above example gives rise to two key observations. First, even stationary Markov processes exhibit dynamic causal behaviors that are not captured when taking an outer expectation. Second, by averaging over all possible histories, TE and DI are minimally affected by patterns that occur with low probability, even if those patterns induce a high level of causal influence.

We now discuss some key properties of the proposed causal measure. First, we note the crucially important quality of non-negativity, which follows directly from the non-negativity of KL-Divergence. Next, we characterize our measure as being “semi-local.” We note that GC, DI (rate), TE, etc. are all expectations, determined entirely by the underlying probabilistic model of the observed data. In this regard, these measures are not local representations of the observed data. On the other end of the spectrum, local data-dependent versions of these measures may be obtained by substituting the self information for entropy, but these local versions of causal measures may be negative when unlikely sequences occur. Our measure is “semi-local” in the sense that at any given time, the measure is determined by the observations from the past, but guarantees non-negativity by taking an expectation over the future.

## Iii Estimation of the Causal Measure

An estimate of the causal measure can be obtained by simply estimating the complete and restricted distributions and then computing the KL divergence between the two at each time. Such an estimator allows us to leverage results from the field of sequential prediction. The sequential prediction problem formulation we consider is as follows: for each round , having observed some history , a learner selects a probability assignment , where

is the space of probability distributions over

. Once is chosen, is revealed and a loss

is incurred by the learner, where the loss function

is chosen to be the self-information loss given by .

The performance of sequential predictors may be assessed using a notion of regret with respect to a reference class of probability distributions . For a given round and reference distribution , the learner’s regret is:

 r(^fi,~fi,xi)=l(^fi,xi)−l(~fi,xi) (4)

In many cases the performance of sequential predictors will be measured by the worst case regret, given by:

 Rn(~Pn) =supxn∈Xnn∑i=1l(^fi,xi)−inf~f∈~Pnn∑i=1l(~fi,xi) (5) ≜supxn∈Xnn∑i=1r(^fi,f∗i,xi) (6)

where is defined as the distribution from the reference class with the smallest cumulative loss up to time , i.e. the for which is largest. We also define to be the cumulative loss minimizing joint

distribution, noting that the reference class of joint distributions

is not necessarily equal to (i.e. ), as often times there may be a constraint on the selection of the best reference distribution that is imposed in order to establish bounds. In the absence of any restrictions, the reference distributions may be selected at each time such that , resulting in zero cumulative loss for any sequence . Thus, bounds on regret often assume stationarity by enforcing or assume that for all but some small number of indices. For various learning algorithms (i.e. strategies for selecting given ) and reference classes , these bounds on the worst case regret are defined as a function of the sequence length :

 Rn(~Pn)≤M(n) (7)

It follows naturally that an estimator for our causal measure can be constructed by building two sequential predictors. The restricted predictor computed at each round using , and the complete predictor computed at each round using . It then follows that each of these predictors will have an associated worst case regret, given by and , where and represent the restricted and complete reference classes. Using these sequential predictors, we define our estimated causal influence from to at time as:

 ^CY→X(i)=D(^f(c)Xi∣∣^f(r)Xi) (8)

It should be noted that when averaged over time, this estimator becomes a universal estimator of the directed information rate for certain predictors and classes of signals [jiao2013universal].

To assess the performance of an estimate of the causal measure, we define a notion of causality regret:

 CR(n)≜n∑i=1∣∣^CY→X(i)−C∗Y→X(i)∣∣ (9)

where we define:

 C∗Y→X(i)=D(f(c)∗Xi∣∣f(r)∗Xi) (10)

with and defined as the loss minimizing distributions from the complete and restricted reference classes. We note that with this notion of causal regret, the estimated causal measure is being compared against the best estimate of the causal measure from within a reference class. As such, we limit our consideration to the scenario in which the reference classes are sufficiently representative of the true sequences to produce a desirable (i.e. for all ).

We now present the necessary preliminaries for proving a finite sample bound on the estimates of causality regret for the special case when is a discrete space. We begin by introducing two assumptions.

###### Assumption 1.

For sequential predictors and and observations , we assume that there exists some for which the collection of observations is such that:

 supx∈X∣∣ ∣ ∣∣log^f(c)Xi(x)^f(r)Xi(x)∣∣ ∣ ∣∣

Noting that is lower bounded by , Assumption 1 implies (given its role in Theorem 1) that larger levels of causal influence take longer to estimate accurately.

###### Assumption 2.

For loss minimizing distributions and , restricted sequential predictor , and observations :

 n∑i=1∣∣∣Ef(c)∗Xi[r(^f(r)Xi,f(r)∗Xi,X)]∣∣∣≤M(r)(n) (12)

While it is understood that the expected regret is in general bounded by worst case regret, assumption 2 requires that the reference classes are sufficiently rich that the expected regret is not too large in absolute value. This is necessary in bounding the causality regret because unlike the regret defined by (5), increases when the estimated distributions outperform the regret minimizing distributions.

We now show that the cumulative KL divergence from the best reference distribution to the predicted distribution is less than the predictor’s worst-case regret.

###### Lemma 1.

For a sequential predictor with worst case regret , a collection observations , and any distribution from the reference class :

 n∑i=1D(fi∣∣^fi)≤M(n) (13)
###### Proof.
 n∑i=1D(fi∣∣^fi) =n∑i=1∑x∈Xfi(x)logfi(x)^fi(x) ≤n∑i=1[supx∈Xlogfi(x)^fi(x)]∑x∈Xfi(x) =n∑i=1supx∈Xr(^fi,fi,x) ≤supxn∈Xnn∑i=1r(^fi,fi,xi) ≤supxn∈Xnsupf∈~Pnn∑i=1r(^fi,fi,xi) ≤M(n)

Next, we bound the cumulative difference in expectation of a bounded function between the best reference distribution and sequential predictor.

###### Lemma 2.

For a sequential predictor with worst case regret , a collection observations , cumulative loss minimizing distribution , and bounded functions with :

 n∑i=1∣∣Ef∗i[gi(X)]−E^fi[gi(X)]∣∣≤|X|K√2√n⋅M(n) (14)
###### Proof.
 n∑i=1 ∣∣Ef∗i[gi(X)]−E^fi[gi(X)]∣∣ =n∑i=1∣∣ ∣∣∑x∈X[f∗i(x)−^fi(x)]gi(x)∣∣ ∣∣ ≤n∑i=1∑x∈X∣∣f∗i(x)−^fi(x)∣∣|gi(x)| (15) ≤n∑i=1∑x∈XK√12D(f∗i∣∣^fi) (16) =|X|K√2n∑i=1√D(f∗i∣∣^fi)

where (15) uses the triangle inequality and (16) uses Pinsker’s inequality and the boundedness of . Focusing on the sum, we define such that for :

 n∑i=1√D(f∗i∣∣^fi) =||→v||1 ≤√n||→v||2 (17) =√n(n∑i=1D(f∗i∣∣^fi))12 ≤√n⋅M(n) (18)

where (17) uses Hölders inequality and (18) uses Lemma 1 and the assumption that . ∎

Finally, we can utilize the assumption and lemmas to bound the cumulative causality regret:

###### Theorem 1.

Let the worst case regret for the predictors and be bounded by and , respectively. Then, for any collection of observations such that Assumption 1 holds with bound , we have:

 n∑i=1 ∣∣^CY→X(i)−C∗Y→X(i)∣∣≤ (19) M(c)(n)+M(r)(n)+|X|L√2√n⋅M(c)(n)
###### Proof.

We begin by defining the functions:

 ^gi(X)≜log^f(c)Xi(X)^f(r)Xi(X)     g∗i(X)≜logf(c)∗Xi(X)f(r)∗Xi(X).

Using the definition of the causal measure and KL-divergence:

 n∑i=1∣∣^CY→X(i)−C∗Y→X(i)∣∣ −∣∣∣Ef(c)∗Xi[^gi(X)]−E^f(c)Xi[^gi(X)]∣∣∣ (20) −∣∣∣Ef(c)∗Xi[^gi(X)]−E^f(c)Xi[^gi(X)]∣∣∣ ≤n∑i=1∣∣∣∣∣∣Ef(c)∗Xi[g∗i(X)]−E^f(c)Xi[^gi(X)]∣∣∣ −∣∣∣Ef(c)∗Xi[^gi(X)]−E^f(c)Xi[^gi(X)]∣∣∣∣∣∣ (21) ≤n∑i=1∣∣∣Ef(c)∗Xi[g∗i(X)]−E^f(c)Xi[^gi(X)] −Ef(c)∗Xi[^gi(X)]+E^f(c)Xi[^gi(X)]∣∣∣ (22) =n∑i=1∣∣∣Ef(c)∗Xi[g∗i(X)−^gi(X)]∣∣∣ =n∑i=1∣∣ ∣ ∣∣Ef(c)∗Xi⎡⎢⎣logf(c)∗Xi(X)^f(c)Xi(X)−logf(r)∗Xi(X)^f(r)Xi(X)⎤⎥⎦∣∣ ∣ ∣∣ ≤n∑i=1∣∣∣D(f(c)∗Xi∣∣^f(c)Xi)∣∣∣+∣∣ ∣ ∣∣Ef(c)∗Xi⎡⎢⎣logf(r)∗Xi(X)^f(r)Xi(X)⎤⎥⎦∣∣ ∣ ∣∣ (23) ≤M(c)(n)+M(r)(n) (24)

where (21) follows from the properties of absolute value, (22) follows from the reverse triangle inequality, (23) follows from the triangle inequality, and (24) follows from non-negativity of the KL-divergence, Lemma 1, and Assumption 2. Moving the second term of (20) to the other side of the inequality yields:

 n∑i=1∣∣^CY→X(i)−C∗Y→X(i)∣∣ ≤M(c)(n)+M(r)(n)+∣∣∣Ef(c)∗Xi[^gi(X)]−E^f(c)Xi[^gi(X)]∣∣∣ ≤M(c)(n)+M(r)(n)+|X|L√2√n⋅M(c)(n) (25)

where (25) follows from Assumption 1 (boundedness of ) and Lemma 2. This concludes the proof. ∎

## Iv Simulations

We begin by demonstrating the estimation of the proposed causal measure on a pair of jointly Markov binary processes and that undergo a change point in the underlying parameters. For , we use a logistic model to represent the probabilities that and are equal to one given the complete history:

 p(c)Xi(xi−1,yi−1)≜eθjx+θjxxxi−1+θjyxyi−11+eθjx+θjxxxi−1+θjyxyi−1 p(c)Yi(xi−1,yi−1)≜eθjy+θjyyyi−1+θjxyxi−11+eθjy+θjyyyi−1+θjxyxi−1

To compute the true causal measure, we additionally need the restricted probabilities. It is important to note that the joint-Markovicity does not imply that the processes are individually Markov. As such, the restricted probability that is equal to one given the restricted history is defined using a recursively updated distribution over the hidden :

 p(r)Xi(xi−1)≜p(h)Yi−1⋅p(c)Xi(xi−1,1)+¯p(h)Yi−1⋅p(c)Xi(xi−1,0) p(h)Yi≜p(h)Yi−1⋅p(c)Yi(xi−1,1)+¯p(h)Yi−1⋅p(c)Yi(xi−1,0)

where is the probability of the hidden being one and . We can define similarly using .

To estimate the causal measure (in both directions) we estimate both the complete and restricted probabilities using a Bayesian updating scheme over a discretized parameter space with a uniform prior. To accommodate the parameter change point, we incorporate the shrinking to the prior technique [sancetta2012universality] with and to the updating procedure. For space considerations we have omitted further details of experiments and provide detailed code on Github.

Figure 1 shows the true and estimated causal measures. We can see that the spikes in causal influence are captured by the estimate. This example illustrates that the proposed causal measure is not immune to the difficulties of change point scenarios in that it takes roughly 100 samples after the change point for the estimator to adapt. However, a key point is that once the estimator does adapt, the temporal resolution is much better than what could be expected from windowing techniques, as the infrequent spikes seen in rightmost column of Figure 1 are localized to a single time point.

## V Discussion

We have presented a non-negative measure of local causal influence that captures the time-varying nature of causal relationships that is inherent to both stationary and non-stationary settings. Furthermore, we have shown that under mild assumptions, the finite sample performance of an estimator of the measure can be determined as a function of the regret of the sequential predictors used to implement the estimator.

It is important to note that the proposed causal measure does not solve the problem of estimating causal influence in time-varying settings, but rather it provides a perspective on causal influence that is naturally extended to any setting for which there are effective sequential prediction algorithms. By conditioning on the observed past, we avoid the need to decide a window length (to approximate an expectation) when estimating DI and TE in a time-varying setting.

Further research includes calculating the causal regret for specific estimators and carefully characterizing the circumstances for which the assumptions hold. Additionally, estimation of the measure on real data would enable moving past simply identifying the direction of information flow between real-word processes to identifying particular patterns for which the causal influence is greatest.