The need to identify and quantify causal relationships between two or more time series from purely observational data is ubiquitous across academic disciplines. Recent examples include economists seeking to understand the directional interdependencies between foreign stock indices [junior2015dependency, dimpfl2014impact] and neuroscientists seeking to identify directed networks that explain neural spiking activity [kim2011granger, quinn2011estimating].
A major step towards assessing causal influences between time series was taken in 1969, when Granger [granger1969investigating] built upon the the ideas of Wiener [wiener1956theory] and proposed the following perspective: We say that a time series is “causing” if we can better predict given the past of all information than given the past of all information excluding . While Granger’s original treatment was applied only to linear Gaussian regression models, the underlying perspective is still utilized throughout causality research. More modern information theoretic interpretations of Granger’s perspective include directed information (DI) [marko1973bidirectional, massey1990causality] and transfer entropy (TE) [schreiber2000measuring], which is equivalent to Granger causality (GC) for Gaussian autoregressive processes [barnett2009granger]. Justification for use of the DI for characterizing directional dependencies between processes was given in [quinn2011equivalence], where it was shown that, under mild assumptions, the DI graph is equivalent to the so-called minimal generative model graph. It was further shown in [etesami2014directed] that the DI graph can be viewed as a generalization of linear dynamical graphs. As a result, the directional dependencies encoded by DI are well equipped to identify the presence or absence of a causal link under Granger’s perspective in the general non-linear and non-Gaussian settings.
Interestingly, both GC and DI are determined entirely by the underlying probabilistic model (i.e. joint distribution) of the random processes in question. It is clear that once the model is determined, these methods provide no ability to distinguish between varying levels of causal influence that may be associated withspecific realizations of those processes. As a result, GC and DI are only well suited to answer causal questions that are concerned with average influences between processes. Examples of this style of question include “Does dieting affect body weight?” and “Does the Dow Jones stock index influence Hang Seng stock index?”. Symbolically, we represent this question as Q1:“Does cause ?”, where the superscript represents and the collection of samples up to time
and capital letters are used to represent random variables and processes.
A natural next question to ask is how the aforementioned measures may be adapted to be sample path dependent. In particular, one might pose the question Q2:“Did cause ?”, where the lowercase letters now represent specific realizations of the processes and . Examples of these questions would be “Did eating salad cause me to lose weight?” and “Did the dip in the price of the Dow Jones cause the spike in the price of the Hang Seng?”. One information theoretic approach to answering Q2 is the substitution of self-information for entropy wherever entropy appears in the definition of DI [lizier2014measuring]. The issue is that the resulting “local” extension of DI may take on negative values, and it is unclear how these values should be interpreted with regard to the presence/absence of a causal link. As a result, causal measures that use the self information have not seen widespread adoption. While this issue may appear to be a result of a particular methodology, it is in fact a fundamental challenge with Q2. To see this, one must only consider how to answer Q2 in scenarios where there exists some other for which
has non-zero probability, or whereresults in occurring with low probability.
While it is clear that Q1 lacks the resolution to identify specific points on a sample path for which a large causal influence is elicited, Q2 gives rise to ambiguities that cannot be addressed in a consistent manner from purely observational data. This observation motivates our proposed question of study, Q3:“Does cause ?”. In other words, we seek to identify the causal effect that particular values of have on the distribution of the subsequent sample of . Examples of this include “Which diets are most informative about weight loss outcomes?” and “When does the Dow Jones have the greatest effect on the Hang Seng?”. To answer this question, we build on the work of [kim2014dynamic] and [schamberg2018sample] in the development of a sample path dependent measure of causal influence.
Such a measure will necessarily capture dynamic changes in causal influence between processes. The means by which causal influences vary with time is two-fold. First, it is clear that when the joint distribution of the collection of processes is non-stationary, there will be variations in time with respect to their causal interactions. Second, we note that stationary processes may exhibit time-varying causal phenomena when certain realizations of a process have a greater level of influence than others (see Section III-A1). The latter cannot be captured by GC and DI, which are determined entirely by the joint distribution and thus will only change when the distribution changes. Furthermore, since estimating GC and DI requires taking a time-average, capturing dynamic changes resulting from non-stationarities necessitates approximating an expectation using a sliding window. The sample path dependent measure, on the other hand, captures both types of temporal dynamics: estimates of the sample path measure can be obtained for any processes for which we can have reliable sequential prediction algorithms.
In developing techniques for estimating the proposed measure, we have identified a challenge in estimating information theoretic measures of causal influence that has been commonly overlooked in the literature. While it is well understood that a collection of jointly Markov processes does not necessarily exhibit Markovicity for subsets of processes, the implications of this on information theoretic causal measures are not well studied. An analogous statement with regard to finite order autoregressive processes and the biasing effect this has on estimates of GC was studied in [stokes2017study], but this work has yet to be adopted in the information theory community. It comes as no surprise that the issues with GC estimators identified in [stokes2017study] may be extended to DI estimators. Thus, a characterization of when estimators of DI are unbiased and a means of addressing the bias when it arises are lacking. As such, we address both of these unmet needs in Section IV-A
in an effort to establish an understanding of when one can expect to obtain unbiased estimates of information theoretic causal measures.
The contributions of this paper may be summarized as follows:
A methodology for assessing causal influences between time series in a sample-path specific and time-varying manner, by answering the question “Does cause ?”. This is particularly relevant when there are infrequent events which exhibit large causal influences, which would be “averaged out” using any causal measure (e.g. GC and DI) which takes an average over all sample paths.
A framework using sequential prediction for estimating the dynamic causal measure with associated upper bounds on the worst case “causality regret”.
A characterization of when unbiased estimates of DI can be obtained using concepts from causal graphical models and a novel methodology for bounding the DI when unbiased estimates cannot be obtained.
Demonstration of the causal measure’s value through application to simulated and real data.
The remainder of this paper is organized as follows: following a brief overview of notation, Section II provides a technical summary of related work. In Section III we define the measure, present key properties, and provide justification for the measure through several examples. Section IV provides a framework for estimating the measure. Section V demonstrates the measure on simulated and real data. Finally, Section VI contains a discussion of the results and opportunities for future work.
I-a Notation and Definitions
Let , , and denote discrete finite-alphabet random processes, unless otherwise specified. We denote processes at a given time point with a subscript and denote the space of values they may take with caligraphic letters, i.e. . Without loss of generality, let . A temporal range of a process is denoted by a subscript and superscript, i.e. , and we define . Realizations of processes are given by lowercase letters. Probability mass functions (pmfs) are equivalently referred to as “distributions” and are denoted by . These distributions are characterized by a subscript, which is often omitted when context allows. For example gives the distribution of a single time point of , gives the joint distribution of and , and gives the conditional distribution of at a single time conditioned on the past of . Lastly, we define the causally conditional distribution with lag as:
Note that the standard interpretation of the causal conditioning (as in [kramer1998directed]) is recovered by letting .
We will briefly review some information theoretic quantities that are used frequently throughout the paper. The entropy is given by:
where it is implied that the sum is over all and the logarithm is base two (as are all logarithms throughout). The conditional entropy is given by:
The causally conditional entropy is given by substituting the causally conditional distribution for the conditional distribution:
For any of the above defined variants of entropy, the corresponding entropy rates are given by:
It should be noted that the entropy rates may not exist for all processes.
The conditional mutual information is given by:
with the (unconditional) mutual information being obtained by removing everywhere it appears in the above equation. Finally, the relative entropy or KL-divergence between two distributions and is given by:
Ii Related Work
In discussions of causal inference, it is important to differentiate between the deterministic and stochastic settings as well as the interventional and observational settings. With regard to the former, we limit consideration strictly to the stochastic setting. In other words, This restriction is necessary when utilizing Granger’s perspective, as comparing qualities of prediction in stochastic settings is our main interest. With regard to the latter, we focus on the observational setting, wherein the potential causes (i.e. in Q3) may not be controlled or perturbed, in comparison to the causal intervention calculus pioneered by Pearl [pearl2009causality].
We now provide a brief summary of three key concepts in the measurement of causal influence across time series, namely Granger causality (GC) [granger1969investigating], directed information (DI) [marko1973bidirectional, massey1990causality], and causal strength (CS) [janzing2013quantifying]. While some key points will be presented here, a comprehensive summary of the relationships between GC and DI may be found in [amblard2012relation].
Ii-a Granger Causality
While Granger’s perspective on causality underlies most modern studies in causality between time series, his original treatment was limited to linear Gaussian AR models [granger1969investigating]. For clarity, we will here present the case with scalar time series. Formally, define the three real-valued random processes . As in Granger’s original treatment, we let represent all the information in the universe in order to avoid the effects of confounding. Next, define two models of :
where are the model parameters and and . We see that the class of models given by (6) is a subset of the models given by (5) where the next does not depend on past . Thus, a non-negative measure of the extent of causal influence of on may be defined by:
The limitations of Granger causality extend considerably beyond the restriction to linear models (see [stokes2017study] for a comprehensive summary). Of particular interest is the fact that if a VAR process is of finite order, subsets of the process will in general be infinite order. While it is possible to redefine the model in (6) to be infinite order, this creates obvious challenges in attempting to estimate Granger causality. Considering this issue is not addressed by the subsequent existing methods, we will revisit this issue in Section IV-A.
Ii-B Directed Information
The concept directed information was first introduced under the name transinformation by Marko in 1973 [marko1973bidirectional] in the context of bidirectional communication theory. It was later revisited in 1990 by Massey [massey1990causality] who defined the directed information from a sequence to as:
Unless otherwise specified, we assume that there are no instantaneous causations, i.e. that and are conditionally independent given the past and . Should one want to allow for instantaneous causations, the proposed methods may be trivially extended to accommodate. As such, we will primarily consider the reverse DI [jiao2013universal]:
noting that under the assumption of no instantaneous causation, .
Given that the DI is given by a sum over time, one may be interested in how causal relationships are exhibited on average. This can be accomplished through use of the directed information rate [kramer1998directed], given by:
Lastly, we note that in order to avoid confounding, we may include the side information to get the following definitions of causally conditioned DI and causally conditioned DI rate:
Ii-C Causal Strength
In [janzing2013quantifying], Janzing et al. propose an axiomatic measure of causal strength (CS) based on a set of postulates that they propose should be satisfied by a causal measure. Furthermore, they present numerous examples to illustrate where Granger causality and directed information do not give results consistent with intuition. While this measure was proposed to measure influences in general causal graphs, it has a clear interpretation in the context of measuring causal influences between two time series. In particular, for measuring the CS from to , begin by considering the generalization of the two models utilized by GC in (5) and (6
) to arbitrary probability distributionsand . Next, note that the second distribution has the following factorization when summing over all possible pasts of :
The first term in the sum may be viewed as measuring the direct effects of the pasts of , , and , on the distribution of . The second term, however, is in some sense measuring the indirect effects of the pasts of and on in that they affect the distribution of through their effect on the distribution of . Thus, the key idea behind CS is the introduction of the “post-cutting” distribution, where the conditional distribution found in the second term is replaced with a marginal distribution (see Section 4.1 of [janzing2013quantifying] for a formal definition). As a result, the (time series) CS from to with side information is given by:
where the expectation is taken with respect to and the post-cutting distribution is defined as:
The post-cutting distribution is designed to ensure that the extent to which has a causal effect on depends only upon and other direct causes of (see P2 in [janzing2013quantifying]). In the context of measuring causal influences between time series, this can be seen as correcting for scenarios in which may be very well predicted by its own past while not being caused by its own past. This scenario arises in models like the one depicted in the center of Figure 1. In such a scenario, it is possible to have despite the fact that is, in some sense, the sole cause of . The details of this example are made clear in Section III-A2.
By presenting an axiomatic framework for measuring causal influences, Janzing et al. provide a robust justification CS. With that said, we note that like GC and DI, CS is determined solely by the underlying probabilistic model. As such, it may be the preferred technique for addressing Q1, but it does not represent how different realizations may give rise to different levels of causal influence.
Ii-D Self-Information Measures
All of the aforementioned techniques involve taking an expectation over the histories of the time series in question, and are thus well suited to address Q1. In order to address Q2, a notion of locality may be introduced through use of self-information. For a given realization of a random variable , the self-information is given by and represents the amount of surprise associated with that realization. By replacing entropy with self-information, and its conditional form , a local version of DI and its conditional extension may be obtained (see Table 1 in [lizier2014jidt] for other so-called “local measures”). As an example, we note that for a given pair of realizations and , a “directed information density” (using the language of [gray2011entropy]) may be given by:
While this indeed creates a sample path measure of causality whose expectation is DI, it is clear it may take on negative values. Such a scenario occurs when the knowledge that makes the observation of less likely to have occurred. While self-information measures are a good candidate for beginning to address Q2 given their dependence upon realizations, the potential for negative values creates difficulty in trying to obtain an easily interpretable answer in all cases.
Ii-E Time-Varying Causal Measures
A popular extension of GC style causal measures is application to time-varying scenarios [sheikhattar2018extracting, oselio2017dynamic]. In order to adapt existing methods to these types of scenarios, it is necessary to evaluate them over stretches of time for which there is stationarity. As such, estimation in this scenario necessitates some sort of sliding window technique in order to approximate an expectation, giving rise to a trade-off between sensitivity to dynamic changes and accuracy. Despite being concerned with time-varying causal influences, these approaches are still ultimately attempts to answer Q1 in that the quantity being estimated is determined solely by the underlying joint distribution. The temporal variability that is measured by these approaches is a result only of potential non-stationarities. This is fundamentally different from the question we are asking, which is concerned with the dynamic causal influences that are associated with a particular realization of a process that may or may not be stationary.
Iii A Sample Path Measure of Causal Influence
We begin by considering the scenario where, having observed , we wish to determine the causal influence that has on the next observation of . Define the restricted (denoted ) and complete (denoted ) histories as:
The current time samples of side information from the histories (i.e. and ) are intentionally omitted, as we assume that there is no instantaneous coupling. We next define the restricted and complete conditional distributions as:
Using these distributions, the sample path causal measure from to in the presence of side information at time is defined by:
For ease of notation, we may refer to the causal measure at time simply as .
The proposed causal measure has an interesting relationship to the directed information. To illustrate this, consider the conditional mutual information term that appears in the sum in (10), along with two equivalent representations:
These equivalent definitions of directed information yield two interpretations. While (20) considers the reduction in uncertainty obtained by conditioning on , (21) considers the change in the distribution resulting from the added conditioning as measured by a log-likelihood ratio. When we wish to condition on a realization , these representations are no longer equivalent:
The representation given by (23) is chosen to be the sample path causal measure and is indeed equivalent to the proposed measure in (19). This choice is made clear by noting two properties of (22). First, we note that (22) may be negative. Second, for particular realizations of and , we may have that conditioning on drastically shifts the distribution of while only mildly affecting the conditional entropy, yielding a value of nearly zero for a scenario when there is a clear causal influence. We note that the difference between definitions of DI that is induced by conditioning on a realization is acknowledged in [jiao2013universal], where four unique estimators of DI are proposed based on these various equivalent definitions of DI. While these estimators converge to the same result in the estimation of DI, the different perspectives yield different results for the question we are addressing and thus their implications must be considered.
As a result of the added conditioning, the proposed measure is a random variable that takes on a value for each possible history and may be related to the directed information as follows:
In the absence of instantaneous influences, the sum of the expectation over sample paths of the proposed causal measure is the directed information:
See Appendix D for a proof of the proposition.
A second key property of the proposed measure is non-negativity (for any history), which follows directly from the properties of the KL-divergence. Furthermore, the measure will take a value of zero if and only if the complete and restricted distributions are equivalent for a given history. As such, the proposed causal measure may take on a large value when the additional condition on introduces a large amount of uncertainty into the distribution of . In such a scenario, we would expect to have a significant causal influence on even though it is not causing to take on a specific value. It is this type of scenario that makes Q2 so difficult to answer in a consistent manner, despite having a clear interpretation in terms of Q3.
Iii-a Justification for Measurement of Sample Path Influences
We now present a series of examples that illustrate the value of a sample path causal measure. Graphical representations of the three examples can be seen in Figure 1.
Iii-A1 IID Influences
Let iid for and:
for . Intuitively, the extent to which influences will vary for different values provided that . In order to compute the causal measure , we first need to find the restricted distribution of given only its own past:
Noting that when and when , the causal measure is given by:
Thus, we see that as ,
By contrast, the DI rate is given by taking the expectation of over possible values . Defining , we get:
As a result, it is clear that the sample paths that occur with lower probability will give rise to a greater causal measure than those that occur with higher probability; however, as a result of their lesser probability, these infrequent, highly influential events will have little influence in the computation of the DI rate.
We further note that while it is tempting to invoke “conditioning reduces entropy” to conclude that represents a reduction in uncertainty that is obtained by including the past in the prediction of , this is not the case. To make this clear, assign values and in (25) and again let approach zero. In such a scenario, we find that:
As such, it is clear that by additionally conditioning on , there is a considerable increase in uncertainty. Thus, while it is certainly true that , there are scenarios in which a particular realization of may cause uncertainty in . Revisiting Q2, it is not clear how to answer the extent to which the event causes any particular outcome , because all possible outcomes are equally likely. On the other hand, if we consider Q3, it is quite clear that the event has significant influence on and that this is reflected by the proposed measure.
Iii-A2 Perturbed Cross Copying
We next consider a scenario where two processes repeatedly swap values. This example was originally posed in [ay2008information] and modified to include noise in [janzing2013quantifying]. Formally, the processes may be defined as:
where for all and is the XOR operator. We again consider the limiting case where is taken to approach zero. As is shown in [janzing2013quantifying], the DI rate approaches zero as . This results from the fact that for very small , on average contains virtually no information about that is not contained in .
Janzing et al. [janzing2013quantifying] note that because and are independent given , should, in some sense, be fully responsible for the information that is known about . As a result, for this example their proposed causal strength measures the average reduction in uncertainty obtained by conditioning on versus conditioning on nothing at all, i.e. as (under the assumption the and are initiated by fair coin tosses).
Next, we consider our proposed sample path measure. First, we note that the complete distribution of depends only upon and the restricted distribution depends only upon . Explicitly, we get the following distributions:
As a result, we see that for a given complete history we get:
Thus, we see that as , if and otherwise.
A comparison of the three measures makes clear that each provides a slightly different perspective. DI rate is loyal to the Granger’s perspective in that it captures how, as , contains less and less information about that is not already known. As a result is strictly decreasing for decreasing . Causal strength, on the other hand, is loyal to the causal Markov condition in the sense that it restricts consideration to only the immediate parents of the node in question (see P2 in Section 2 of [janzing2013quantifying]). As such, decreasing yields a smaller level of uncertainty in conditioned on , and therefore the causal strength is strictly increasing for decreasing . The proposed measure lies somewhere in between the two in that it simultaneously captures the decrease and increase in effect of on as shrinks. Deciding which perspective is “correct” is a philosophical question that must be answered on a problem-by-problem basis. In any case, the proposed measure provides an interesting perspective that, to our knowledge, has not been considered in the literature.
Iii-A3 Horse Betting
Consider the problem of horse race gambling with side information as presented in Section III-A of [permuter2011interpretations] (with minor adjustments to notation). At each time the gambler bets all of their wealth based on the past winners and side information . As a result, the gambler’s wealth at time , denoted
, is a function of the winning horses and side information up to that time. Lastly, the amount of money that is won for betting on the winning horse is given by the odds, and the portion of wealth bet on each horse is given by with . Thus, the evolution of the wealth can be described recursively as:
Finally, the expected growth rate of the wealth is defined as .
It is shown in [permuter2011interpretations] that the betting strategy that maximizes the expected growth rate is given by distributing bets according to the conditional distribution of given all available information:
Similarly, we can define a restricted betting strategy where the side information is not available (and optimal strategy ). The wealth that is obtained under that strategy is then given by:
Letting and represent the wealth resulting from using the optimal strategies, it is further shown in [permuter2011interpretations] that the increase in growth rate resulting from including side information in the betting strategy is given by:
It should be noted that the result in (26) holds for any choice of odds . Thus, we proceed by making the mild assumption that the odds chosen by the racetrack are such that, for any past sequence of winners , the gambler optimally betting without side information is expected to lose money on round :
for some . We define the above equation as the conditional expected growth rate for race (without side information). As a consequence, this implies a negative expected growth rate for the gambler’s wealth without side information:
where the initial wealth is assumed, without loss of generality, to be .
It follows that a gambler with access to side information ought to gamble only if their expected growth rate is greater than zero. Applying this condition to (26), a gambler with side info can expect to win money if:
Thus, when equipped with the DI, a gambler will decide either to visit the racetrack and bet on every race or to stay at home. It turns out, however, that the gambler may be doing themselves a disservice by staying home any time that (28) does not hold. To see this, suppose that before race the gambler has witnessed winners and side information , and wishes to gamble if they expect to make money on the current race. Such a scenario occurs when the conditional expected growth rate for round is positive:
Iv Estimating the Causal Measure
An estimate of the causal measure can be obtained by simply estimating the complete and restricted distributions and then computing the KL divergence between the two at each time. Such an estimator allows us to leverage results from the field of sequential prediction [merhav1998universal]. The sequential prediction problem formulation we consider is as follows: for each round , having observed some history , a learner selects a probability assignment , where is the space of probability distributions over . Once is chosen, is revealed and a loss
is incurred by the learner, where the loss functionis chosen to be the self-information loss given by .
The performance of sequential predictors may be assessed using a notion of regret with respect to a reference class of probability distributions . For a given round and reference distribution , the learner’s regret is:
In many cases the performance of sequential predictors will be measured by the worst case regret, given by:
where is defined as the distribution from the reference class with the smallest cumulative loss up to time , i.e. the for which is largest. We also define to be the cumulative loss minimizing joint distribution, noting that the reference class of joint distributions is not necessarily equal to (i.e. ), as often times there may be a constraint on the selection of the best reference distribution that is imposed in order to establish bounds. In the absence of any restrictions, the reference distributions may be selected at each time such that , resulting in zero cumulative loss for any sequence . Thus, sequential prediction problems impose restrictions on the reference distributions with which to compare predictor performance [merhav1998universal]. For example, one may assume stationarity by enforcing or assume that for all but some small number of indices. For various learning algorithms (i.e. strategies for selecting given ) and reference classes , these bounds on the worst case regret are defined as a function of the sequence length :
It follows naturally that an estimator for our causal measure can be constructed by building two sequential predictors. The restricted predictor computed at each round using , and the complete predictor computed at each round using . It then follows that each of these predictors will have an associated worst case regret, given by and , where and represent the restricted and complete reference classes. Using these sequential predictors, we define our estimated causal influence from to at time as:
It should be noted that when averaged over time, this estimator becomes a universal estimator of the directed information rate for certain predictors and classes of signals [jiao2013universal].
To assess the performance of an estimate of the causal measure, we define a notion of causality regret:
where we define:
with and defined as the loss minimizing distributions from the complete and restricted reference classes. We note that with this notion of causal regret, the estimated causal measure is being compared against the best estimate of the causal measure from within a reference class. As such, we limit our consideration to the scenario in which the reference classes are sufficiently representative of the true sequences to produce a desirable (i.e. for all ).
We now present the necessary assumptions for proving a finite sample bound on the estimates of causality regret.
For sequential predictors and and observations , we assume that and are absolutely continuous with respect to each other, i.e.:
Clearly, the above assumption will be satisfied for any sequential prediction algorithm that does not assign zero probability to any outcomes.
For loss minimizing distributions and , restricted sequential predictor , and observations :
While it is understood that the expected regret is in general bounded by worst case regret, assumption 2 requires that the reference classes are sufficiently rich that the expected regret is not too large in absolute value. This is necessary in bounding the causality regret because unlike the regret defined by (31), increases when the estimated distributions outperform the regret minimizing distributions.
Let the worst case regret for the predictors and be bounded by and , respectively. Then, for any collection of observations