 # A causation coefficient and taxonomy of correlation/causation relationships

This paper introduces a causation coefficient which is defined in terms of probabilistic causal models. This coefficient is suggested as the natural causal analogue of the Pearson correlation coefficient and permits comparing causation and correlation to each other in a simple, yet rigorous manner. Together, these coefficients provide a natural way to classify the possible correlation/causation relationships that can occur in practice and examples of each relationship are provided. In addition, the typical relationship between correlation and causation is analyzed to provide insight into why correlation and causation are often conflated. Finally, example calculations of the causation coefficient are shown on a real data set.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## Introduction

The maxim, “Correlation is not causation”, is an important warning to analysts, but provides very little information about what causation is and how it relates to correlation. This has prompted other attempts at summarizing the relationship. For example, Tufte tufte2006 suggests either, “Observed covariation is necessary but not sufficient for causality”, which is demonstrably false or, “Correlation is not causation but it is a hint”, which is correct, but still underspecified. In what sense is correlation a ‘hint’ to causation?

Correlation is well understood and precisely defined. Generally speaking, correlation is any statistical relationship involving dependence, i.e. the random variables are not independent. More specifically, correlation can refer to a descriptive statistic that summarizes the nature of the dependence. Such statistics do not provide all of the information available in the joint probability distribution, but can provide a valuable summary that is easier to reason about. Among the most popular, and often referred to as just “the correlation coefficient” weisstein-correlation, is the Pearson correlation coefficient, which is a measure of the linear correlation between variables.

Causality is an intuitive idea that is difficult to make precise. The key contribution of this paper is the introduction of a “causation coefficient”, which is suggested as the natural causal analogue of the Pearson correlation coefficient. The causation coefficient permits comparing correlation and causation to each other in a manner that is both rigorous and consistent with common intuition.

The rest of this paper is outlined as follows: The statistical/causal distinction is discussed to provide background. The existing probabilistic causal model approach to causality is briefly summarized. The causation coefficient is defined in terms of probabilistic causal models and some of the properties of the coefficient are discussed to support the claim that it is the natural causal analogue of the Pearson correlation coefficient.

The definition of the causation coefficient permits the following new analyses to be conducted: A taxonomy of the possible relationships between correlation and causation is introduced, with example models. The typical relationship between correlation and causation is analyzed to provide insight into why correlation and causation are often conflated. Finally, example calculations of the correlation coefficient are shown on a real data set.

## Statistical/causal distinction

Causality is difficult to formalize. Causality is implicit in the structure of ordinary language brown1983 and the words ‘causality’ and ‘causal’ are often used to refer to a number of disparate concepts. In particular, much confusion stems from conflating three distinct tasks in causal inference heckman2005:

1. Definitions of counterfactuals

2. Identification of causal models from population distributions

3. Selection of causal models given real data

Counterfactuals, as defined in philosophy, are hypothetical or potential outcomes – statements about possible alternatives to the actual situation lewis1973. A classic example is, “If Nixon had pressed the button, there would have been a nuclear holocaust” fine1975, a statement which seems intuitively correct, but difficult to formally model and impossible to empirically verify. Defining causation in terms of counterfactuals originates with Hume in defining a cause to be, “where, if the first object had not been, the second never had existed” hume1748. Indeed, in a world where there had been a nuclear war during the Nixon administration, it would be quite reasonable to claim that launching nuclear missiles was a cause. A key difficulty in making this notion of causality precise is that it requires precise models of counterfactuals and therefore precise assumptions that can be unobservable and untestable even in principle.

For example, consider the possible results of treating a patient in a clinical setting. In the notation of the Rubin causal model holland1986, a particular patient or unit222“Units” are the basic objects (primitives) of study in an investigation in the Rubin causal model approach. Examples of units include individual human subjects, households, or plots of land., , can be potentially exposed to either treatment, , or control, . The treatment effect333This is also referred to as “causal effect” in the literature. In this paper, “treatment effect” is used to avoid confusion with the related but distinct definition of causal effect in the probabilistic causal model approach., , is the difference between the outcomes when the patient is exposed to treatment and when the same patient is exposed to the control. Determining treatment effect on a unit is usually the ultimate goal of causal inference. It is also impossible to observe – the same patient cannot be treated and not treated – a problem which Holland names the Fundamental Problem of Causal Inference.

This is not to suggest that causal inference is impossible, merely that additional assumptions must be made if causal conclusions are to be reached. A well known assumption that makes causal inference possible is randomization. Assuming that units are randomly assigned to treatment or control groups, it is possible to estimate the

average treatment effect, . Note that randomization is an assumption external to the data; it is not possible to determine, from the data alone, that it was obtained from a randomized controlled trial. Another example of an assumption that permits causal inference is unit homogeneity, which can be thought of as “laboratory conditions”. If different units are carefully prepared, it may be reasonable to assume that they are equivalent in all relevant aspects, i.e. and . For example, it is often assumed that any two samples of a given chemical element are effectively identical. In these cases, treatment effect can be calculated directly as .

A closely related concept is ceteris paribus, roughly, “other things held constant”, which is a mainstay of economic analysis heckman2005. For example, increasing the price of some good will cause demand to fall, assuming that no other relevant factors change at the same time. This is not to suggest that no other factors will change in a real economy; ceteris paribus simply isolates the effect of one particular change to make it more amenable to analysis.

In practice, the first causal inference task, defining counterfactuals, requires having a scientific theory. For example, classical mechanics describes the possible states of idealized physical systems and can provide an account of manipulation. The theory can predict what would happen to the trajectory of some object if an external force were to be applied, whether or not such a force was actually applied in the real world. Scientific theories are usually parameterized; one example of a parameter is standard gravity, , the acceleration of an object due to gravity near the surface of the earth nist2008.

The second causal inference task, identification from population distributions, is a problem of uniquely determining a causal model or some property of a causal model from hypothetical population data. In other words, the problem is to find unique mappings from population distributions or other population measures to causal parameters. This can be thought of as the problem of determining which scientific theory is correct, given data without sampling error. A well-designed experiment to determine will, in the limit of infinite samples, yield the exact value for the parameter.

The third task, selection of causal models given real data, is the problem of inference in practice. Any real experiment can only provide an analyst with a finite-sample distribution subject to random error. This problem lies in the domain of estimation theory and hypothesis testing.

In addition to the standard population/sample distinction, this paper follows Pearl’s conventions in referring to the statistical/causal distinction444This distinction has been referred to by many different names, including: descriptive/etiological, descriptive/generative, associational/causal, empirical/theoretical, observational/experimental, observational/interventional. pearl2009. A statistical

concept is a concept that is definable in terms of a joint probability distribution of observed variables. Variance is an example of a statistical parameter; the statement that

is multivariate normal is an example of a statistical assumption. A causal concept is a nonstatistical concept that is definable in terms of a causal model. Randomization is an example of a causal, not statistical, assumption because it is impossible to determine from a joint probability distribution that a variable was randomly assigned.

This distinction draws a sharp line between statistical and causal analysis, which can be thought of as the difference between analyzing uncertain, yet static conditions versus changing conditions pearl2001. Estimating and updating the likelihood of events based on observed evidence are statistical tasks, given that experimental conditions remain the same. Causal analysis aims to infer the likelihood of events under changing conditions, brought about by external interventions such as treatments or policy changes. Much like how statistical inference is performed with respect to assumptions formalized in a statistical model, rigorous causal inference requires formal causal models.

## Probabilistic causal models

Probabilistic causal models555Also referred to as “graphical causal models” and “structural causal models” in the literature. are an approach to causality characterized by nonparametric models associated with a type of directed acyclic graph (DAG) called a causal diagram. The concept of using graphs to model probabilistic and causal relationships originates with Wright’s path analysis wright1934. The modern, nonparametric version appears to have been first proposed by Pearl and Verma pearl1991 and has been the subject of considerable research since then pearl1995 shpitser2008 bareinboim2014-transportability. This section is meant to provide a high-level summary of probabilistic causal models, sufficient to explain the proposed causation coefficient.

### Causal models

The philosophy of probabilistic causal models is that of Laplacian quasi-determinism – a complete description of the state of a system is sufficient to exactly determine how the system will evolve laplace1814. In this view, randomness is a statement of an analyst’s ignorance, not inherent to the system itself.666This excludes quantum-mechanical systems from analysis. Arguably, the ‘intrinsic randomness’ of these systems are why they are so often considered counterintuitive.

A causal model, , consists of a set of equations where each child-parent family is represented by a deterministic function:

 Xi=fi(pai,ϵi)(i=1,…,n)

The variables are the endogenous variables, determined by factors in the model. denotes the parents of , which can be thought of as the direct, known causes of . The variables are the exogenous variables, and can appropriately be considered ‘background’ variables or ‘error terms’ and correspond to variables that are determined by factors outside of the model pearl2009. Since the exogenous variables model those factors that cannot be directly accounted for, they are treated as random variables. Regardless of the distribution of the exogenous variables, or the functional form of the equations, a probability distribution, over the exogenous variables induces a probability distribution, over the endogenous variables pearl2009. The resulting model is called a probabilistic causal model.

Each causal model induces a causal diagram, , where each corresponds to a vertex and each parent-child relationship between and corresponds to a directed edge from parent to child. In this paper, it is assumed that all models are recursive, i.e. all models induce an acyclic causal diagram.

This paper adopts the convention that all of the exogenous variables are mutually independent. If all of the endogenous variables are observable – denoted in the causal diagram by solid nodes – then the model is called Markovian. The joint probability function, , in a Markovian model is said to be Markov compatible with in that respects the Markov condition: each variable is independent of all its non-descendants given its parents in the graph bareinboim2012-local. Dependence between two observable variables that have no observable ancestor can be introduced by adding a latent endogenous variable, denoted in a causal diagram by an open node. Such a model is called semi-Markovian.

Each causal diagram can be thought of as denoting a set of causal models. Most of this paper considers the following set of models with endogenous variables , and :

 Z=fZ(ϵZ)
 X=fX(Z,ϵX)
 Y=fY(X,Z,ϵY)

### Causal effect

Without any additional context, this characterization of probabilistic causal models appears merely to be a way to generate Bayesian networks. However, the functional, quasi-deterministic approach also specifies how the probability distribution of the observable variables change in response to an external intervention. The simplest external intervention is where a single variable

is forced to take some fixed value , ‘setting’ or ‘holding constant’ . Such an atomic intervention corresponds to replacing the equation with the constant , generating a new model. This can be extended to sets of variables.

Causal effect

pearl1995 Given two disjoint sets of variables, and , the causal effect of on , denoted either as or is a function from to the space of probability distributions on . For each realization of , gives the probability of induced by deleting from the model all equations corresponding to variables in and substituting in the remaining equations.

Crucially, causal effect, , is fundamentally different than conditioning or observation, . The latter is a function of the joint probability distribution of the original model, . The former is a function of the distribution of the submodel, , that results from the effect of action on . Intuitively, this can be thought of as ‘cutting’ all of the incoming edges to and replacing the random variable with the constant . Figure 2: An intervention do(X=x) on causal model M produces submodel Mx

It is possible for DAGs to be observationally equivalent, i.e. Markov compatible with the same set of joint probability distributions pearl1991. Two observationally equivalent DAGs cannot be distinguished without performing interventions or drawing on additional causal information. For example pearl2009, in a causal diagram modeling relationships between the season, rain, sprinkler settings and whether the ground is wet, it would be reasonable to accept a model where the season causally effects the sprinkler settings, but not vise-versa. While indistinguishable from observation alone, the two models imply different results due to intervention; it would be implausible that changing the settings on a sprinkler would cause the season of the year to change as a result.

### Identification of causal effect

The problem of whether a causal query can be uniquely answered is referred to as causal identifiability

. An unbiased estimate of

can always be calculated from observational (preintervention) probabilities in Markovian models by conditioning on the parents of and averaging the result, weighted by the probabilities of . This operation is called “adjusting for ” or “adjustment for direct causes” pearl2009. More formally, the observational probability distribution and causal diagram of a Markovian model identifies the effect of the intervention on and is given by:

 P(y∣^xi)=∑paiP(y∣xi,pai)P(pai)

Semi-Markovian models do not always permit identification. A simple example is when a single latent variable is a parent of every observable variable. Informally, it is not possible to determine if observed covariation is indicative of a causal effect between two variables, or whether their common, unobservable parent brings about the correlation. For example shpitser2008, consider two models, and where both models have observable variables , latent , and . In , ; in , . These models are compatible with the same causal diagrams and have identical observational probability distributions, but different causal effects, . Since the causal effect cannot be uniquely calculated from the available information, it is not identifiable. However, many semi-Markovian models still permit estimation of certain causal effects. Complete methods are described in shpitser2008.

## The causation coefficient

The Pearson product-moment correlation coefficient,

, is a standard measure of correlation between random variables. This is commonly described as a measure of how well the relationship between and can modeled by a linear relationship with being a perfect negative/positive linear relationship and representing no linear relationship at all. The population correlation coefficient is defined as a normalized covariance weisstein-correlation:

 ρX,Y=cov(X,Y)√Var[X]Var[Y]=E[XY]−E[X]E[Y]√(E[X2]−E[X]2)(E[Y2]−E[Y]2)

For discrete random variables, this is a function of the joint probability mass function (for continuous random variables that admit a probability density function, the summations are replaced with integrals):

 ρX,Y=∑x∑yxyP(x,y)−∑xxP(x)∑yyP(y)√(∑xx2P(x)−(∑xxP(x))2)(∑yy2P(y)−(∑yyP(y))2)

The causation coefficient relies on the observation that the correlation coefficient can, by the law of total probability, be rewritten as a function of the conditional distribution

, and marginal distribution, , instead of in terms of the joint density:

 ρX,Y=∑x∑yxyP(y∣x)P(x)−∑xxP(x)∑x∑yyP(y∣x)P(x)√Var[X](∑x∑yy2P(y∣x)P(x)−(∑x∑yyP(y∣x)P(x))2)

Syntactically, the causation coefficient, , is defined by replacing with and with . As a convenience, the following terms are also defined: and . The full definition of is then:

 γX→Y=∑x∑yxyP(y∣^x)^P(x)−∑xx^P(x)∑x∑yyP(y∣^x)^P(x)√Var[^X]Var[Y^X]

Where is the causal effect of on and is the distribution of interventions.

### Distribution of interventions

In the discrete case, the distribution of interventions can be thought of as a set of weights for averaging the possible causal effects. It also has an interpretation in the context of observational and experimental studies. As an example, consider a scenario where patients decide for themselves whether or not to take a drug (), and observe whether or not they recover (). The population joint probability distribution, , provides all of the information available from an idealized version of this observational study. For intuition, it may be helpful to imagine as being calculated from millions of samples to the point where random sampling error has ceased to be a relevant consideration.

This simplest way to model this is with Bernoulli (binary) random variables for and , with representing no treatment or failure to recover and representing treatment or recovery. The probability of patients deciding for themselves whether or not to take the drug, in this observational study, is the marginal probability . In clinical terms, and are the relative sizes of the cohorts.

However, even in an idealized observational study, would not provide definitive information on whether treatment actually improves patient outcomes. Hypothetically, the drug could cause unpleasant side effects in the patients that would have received the greatest benefit, leading those patients to choose not to take the drug. An idealized randomized controlled trial would permit an analyst to directly measure , as randomization explicitly cuts out confounding. However, randomized controlled trials are often impractical (e.g. too expensive or unethical) to run in practice.

The relative sizes of the cohorts in an observational study may be different than the relative sizes of the treatment and control groups in a corresponding randomized controlled trial – this is the use of the distribution of interventions . Experiments are often designed to have equal group sizes as this typically provides maximum statistical power, but this is by no means universal. Also, it is not uncommon for patients to drop out or otherwise be disqualified from studies, so the cohorts will often be unequal in practice.

The natural causation coefficient, denoted or , is defined for equal to the pre-intervention marginal distribution, . This corresponds to an experimental trial where the treatment groups are scaled to be proportional to the relative sizes seen in the observational study.

The maximum entropy causation coefficient, denoted or , is the causation coefficient where

is a maximum entropy probability distribution. For random variables with bounded support, this is the uniform distribution and corresponds to equal treatment group sizes.

Other distributions of interventions are possible, to reweigh the effects of certain interventions relative to others in the computation of the causation coefficient. These should be denoted explicitly as . For example, a certain drug may be known to be helpful in certain small doses, but worse than no treatment at all in larger doses, in which case both the natural and maximum entropy coefficients could be misleading. In such cases, a distribution of interventions corresponding to current best practices may be more informative.

### Independence and invariance

The definition of independence of random variables and is: or, equivalently: . In other words, observing provides no information about (and vise-versa). The causal equivalent is invariance of to : ; that is to say, no possible intervention on can affect bareinboim2012-local. Unlike independence, invariance is not symmetric. The term mutually invariant is suggested to refer to when both is invariant to and is invariant to .

For Bernoulli random variables, and are uncorrelated (

) if and only if they are independent. The analogous condition holds for the causation coefficient. For Bernoulli distributed

and , if and only if is invariant to (see appendix for proof). However, both the correlation and causation coefficients have difficulty capturing nonlinear relationships between variables.777

Also worth noting is that neither coefficient is robust to outliers. This can be mitigated by winsorizing or trimming.

In general, independence implies and invariance implies , but the converse does not hold for many distributions.

As a simple example, Table 1 contains interventional distributions where is not invariant to , but the maximum entropy causation coefficient . The natural causation coefficient may be positive, negative or zero depending on the observational (pre-intervention) distribution .

### Average treatment effect

Average treatment effect is defined as chickering1996:

 ATE(X→Y)=P(Y=1∣do(X=1))−P(Y=1∣do(X=0))

This is the probabilistic causal model equivalent of the Rubin causal model definition of average treatment effect. Positive ATE implies that treatment is, on average, superior to non-treatment, while negative ATE implies the opposite. For Bernoulli distributed and , reduces to (see appendix for proof):

 γX→Y=ATE(X→Y) ⎷Var[^X]Var^X[Y]

Since variance is strictly positive for nondegenerate Bernoulli distributions, this implies that has the same sign as the average treatment effect.

## A taxonomy of correlation/causation relationships

For Bernoulli and , and provide a natural way to classify the possible correlation/causation relationships. and can each be positive, negative or zero, implying 9 possible relationships. These are grouped into 5 classifications in Table 2.

In Table 2, “0” is a zero value for the coefficient, and “+/-” refers to the coefficient taking on a positive or a negative value (e.g. inverse causation refers to either a model with positive and negative , or negative and positive ). Note that and are population coefficients; this taxonomy can be thought of as categorizing the possible relationships between correlation and causation, in the limit of infinite samples.

Many of the relationships described in the following sections are well known and existing terminology is used where appropriate. Examples of each relationship are given, as well as simple probabilistic causal models of three Bernoulli distributed variables that produce the described relationship. Notably absent is the notion of mutual causation, which is beyond the scope of this paper. Note that while is symmetric, i.e. , at least one of , is zero in all recursive probabilistic causal models (see appendix for proof).

### Independent and invariant

Two variables that are independent and mutually invariant are completely unrelated – neither observing nor manipulating one can provide information about or change the other. This is usually the default assumption when studying a system – in hypothesis testing, the null hypothesis is usually “no effect”. For a somewhat absurd example, researchers would not believe that the average gas mileage of a Prius is related in any way to the minimum width of the English channel munroe2010 by default – some sort of evidence would be expected before taking such a suggestion seriously. The notion of light cones provides an example familiar to physicists – the principle of locality and the theory special relativity imply that nothing outside of someone’s past and future light cones can ever affect them.

Independent and invariant variables can be trivially mathematically modeled. An example is provided here to introduce the conventions used throughout the rest of this section. Let be fair coins, i.e. independent Bernoulli distributed random variables with . These are the exogenous variables of the probabilistic causal model. will generally model a cause or treatment, , an effect or response, and , a confounding variable that causally effects and . An example model with independent and invariant and is simply:

 Z=ϵZ
 X=ϵX
 Y=ϵY

and are clearly independent and invariant and the correlation and causal coefficients are .

### Common causation

Reichenbach appears to be the first to propose the “Principle of the Common Cause” claiming, “If an improbable coincidence has occurred, there must exist a common cause” reichenbach1956. Elaborating on this, he suggests that correlation between events and indicates either that causes , causes or and have a common cause. This philosophical claim naturally suggests the following definition:

Common Causation

and are said to experience common causation when and are mutually invariant but not independent.

This effect is sometimes referred to as a “spurious relationship” or “spurious correlation” – a term originally coined by Pearson pearson1896. This risks conflating several distinct concepts: the interventional distributions from which is calculated, the population observational distribution from which is calculated, and the finite-sample observational distribution, from which the sample correlation coefficient, is calculated. Consider the following scenarios:

• A very large number of samples are taken from invariant and , but due to a latent confounding variable, and are correlated.

• A small number of samples are taken from independent and invariant and , but due to random sampling errors, the sample correlation coefficient suggests that and are correlated.

In both scenarios, there is a spurious relationship between and . The first scenario exhibits common causation. The second scenario is due to random sampling error and, as the number of samples increases, the observed correlation will tend to zero. The term “coincidental correlation” is suggested to distinguish this finite-sample effect from common causation.

An example of a common cause can be found in a study on myopia and ambient lighting at night quinn1999. Development of myopia (shortsightedness) is correlated with nighttime light exposure in children, although the latter does not cause the former. The common cause is that myopic parents are likely to have myopic children, and also more likely to set up night lights.

The following is a simple common causation model: Let be fair coins and , and be defined by the following three equations:

 Z=ϵZ
 X=Z∧ϵX
 Y=Z∧ϵY

From the observational distribution, it is clear that and are correlated () and from the interventional distributions, that and are invariant ().

### Inverse causation

A classic veridical paradox is the relationship between tuberculosis and dry climate gardner2006. At one point, Arizona, with one of the driest climates in the United States was found to also have the largest share of tuberculosis deaths. This is because tuberculosis patients greatly benefit from a dry climate, and many moved there. The following is proposed as a definition for this type of scenario:

Inverse causation

and are said to experience inverse causation when the correlation coefficient and causation coefficient have the opposite sign.

Inverse causation is of special importance when considering clinical treatment; a case of inverse causation is a case where the correct treatment option is the opposite of what a naive interpretation of correlation would suggest.

The following is a simple model that exhibits inverse causation: Let be a fair coin, and be Bernoulli distributed with . The following is an inverse causation model with and :

 Z=ϵZ
 X=Z
 Y={¬Zif ϵY=1Xif ϵY=0

“Inverse causation” suggested to avoid confusion with other terminology. “Anti-causation” is inappropriate, as “anti-causal filters” in digital signal processing are filters whose output depend on future inputs. “Reverse causation” is also inappropriate, as this refers to mistakenly believing that has a causal effect on , when, causes .

### Unfaithfulness

The Markov condition entails a set of conditional independence relations between variables corresponding to nodes in a DAG. The faithfulness condition [spirtes2000] (also referred to as stability pearl2009) is the converse.

Faithfulness condition

A distribution is faithful to a DAG if no conditional independence relations other than the ones entailed by the Markov condition are present.

This is a global condition, applying to a joint probability distribution and a DAG. The following local condition is defined in terms of two random variables and in a causal model:

Unfaithful

and are said to be unfaithful if they are independent but not mutually invariant.

This local condition can only occur if the global faithfulness condition is violated (see appendix for proof). For Bernoulli random variables, and are unfaithful if and only if and .

The following model is a simple example where and are unfaithful. Let be fair coins. Then, in the following model, and :

 Z=ϵZ
 X=Z
 Y={¬Z,if ϵY=1X,if ϵY=0

Almost all models are faithful in a formal sense – models that do not respect the faithfulness condition have Lebeguse measure zero in probability spaces where model parameters have continuous support and are independently distributed spirtes2000. However, this does not mean that such models can be dismissed out of hand; they are vanishingly unlikely to occur by chance, but can be deliberately engineered.

#### Friedman’s thermostat and the traitorous lieutenant

Consider “Friedman’s Thermostat”; a correctly functioning thermostat would keep the indoor temperature constant, regardless of the external temperature, by adjusting the furnace settings.888Friedman introduced the thermostat analogy in the context of a central bank controlling money supply friedman2003. Its use as a general analogy for correlation and causation has been popularized by Rowe rowe2010. Observation would show external temperature and furnace settings to be anticorrelated with each other and internal temperature to be uncorrelated with both. This does not correspond to the true causal effect that external temperature and furnace settings have on internal temperature.

The sharp-eyed reader will note that the Friedman’s thermostat example is not a recursive (acyclic) causal model. An example of unfaithfulness with a recursive causal model can be seen in the following “Traitorous Lieutenant” problem. Consider the problem of a general trying to send a one-bit message. The general has two lieutenants available to act as messengers, however, one of them is a traitor and will leak whatever information they have to the enemy. The general observes the following protocol: to send a , the general either gives the first lieutenant a and the other a , or the first a and the second a , with equal probability. To send a , the general either gives both lieutenants a , or both lieutenants a , with equal probability. The recipient of the message XORs both lieutenants’ bits to recover the original message.

In this scenario, the traitor will see a and with equal probability, regardless of the actual message. This is unfaithfulness; a lieutenant changing their bit has a causal effect on the final message, but observing a single lieutenant’s bit provides no information.

### Genuine causation and confounding bias

“Genuine causation” is suggested for referring to models where and have the same sign. However, due to confounding bias, the strength of the true casual effect may be different than what a naive interpretation of correlation would suggest.

The causal definition of no confounding is provided by Pearl pearl1998.

No confounding

Let be a causal model. and are not confounded in if and only if .

By the definition of the causation coefficient, no confounding implies .

Genuine causation with negative confounding bias corresponds to , and can be thought of as a weaker version of the type of confounding effect that produces unfaithfulness or inverse causation. In such cases, the true causal effect will be stronger than correlation suggests. Let be a fair coin and be Bernoulli distributed with . Then, the following model exhibits genuine causation with negative confounding bias, with and :

 Z=ϵZ
 X=Z
 Y={¬Z,if ϵY=1X,if ϵY=0

Genuine causation with positive confounding bias corresponds to ; in such cases, the true causal effect will be weaker than correlation suggests. Let be fair coins. In the following model, , the natural causation coefficient, , and the maximum entropy causation coefficient, :

 Z=ϵZ
 X=Z∧ϵX
 Y={Z,if ϵY=1X,if ϵY=0

## Typical relationship between correlation and causation

Common intuition suggests that correlation is closely related to causation. However, the models in the previous section act as a constructive proof that the sign of the correlation coefficient provides no guarantees about the true causal effect. Some insight on this apparent discrepancy can be found by considering the following set of linear probabilistic causal models, parameterized by :

 Z=ϵZ
 X=αZZ+ϵX
 Y=βXX+βZZ+ϵY

Since these models are linear and covariance is bilinear, the population correlation coefficient can be calculated analytically, regardless of the underlying distribution of the error terms:

 ρX,Y=βXσ2ϵX+(α2ZβX+αZβZ)σ2ϵZ√(σ2ϵX+α2Zσ2ϵZ)(β2Xσ2ϵX+σ2ϵY+(αZβX+βZ)2σ2ϵZ)

The natural causation coefficient can also be calculated directly from the definitions of the causation coefficient and causal effect:

 γX→Y=βXσ2ϵX+α2ZβXσ2ϵZ√(σ2ϵX+α2Zσ2ϵX)(β2Xσ2ϵX+σ2ϵY+(α2Zβ2X+β2Z)σ2ϵZ)

The typical relationship between correlation and causation can be analyzed by constructing a probability distribution for the parameters of the linear model. have support over the entire real line; have support over . Given mean and variance , the maximum entropy distributions are and , respectively. Assuming jointly independent distributions over the parameters, it is straightforward to randomly sample models and compute their correlation and causation coefficients. Plotting against yields a graph where each point represents a single linear probabilistic causal model. The (smoothed) result of plotting such a graph is in Figure 5. Figure 5: Causation vs correlation coefficient (kernel density estimation with 106 samples). Darker shading indicates higher density of models. The curves at the top and right of the graph are the marginal densities of ρ and γ.

In the graph of vs , the upper left and lower right quadrants contain inverse causation models and the other two quadrants contain genuine causation models. Except for , which corresponds to an invariant and independent model, the horizontal line, , contains common causation models and the vertical line, , contains unfaithful models.

With maximum entropy distributions over the parameters, the probability of a random linear model exhibiting inverse causation , genuine causation with negative bias , and genuine causation with positive bias . This matches closely with common intuition. Typically, a strong positive correlation indicates a strong positive causal effect – this can be seen in the upper right quadrant, with a high density of models. Inverse causation is possible, although much less likely, and unfaithful models have measure , which accounts for why they are often considered counterintuitive. However, this is an analysis of population, not sample coefficients and the measure of nearly unfaithful models is nonzero.999Formally, the measure of -strong-unfaithful distributions converges to 1 exponentially in the number of nodes uhler2013. In practice, this means that unfaithfulness cannot be dismissed as irrelevant. Although the population correlation will be zero in such models, the sample correlation will often be indistinguishable from zero, despite the possibility of a nontrivial causal effect.

The choice of a maximum entropy distribution in this analysis is based on the principle of maximum entropy, which states that the appropriate prior distribution, given the absence of any other information, is the maximum entropy distribution jaynes2003. However, the choice of linear models and the particular parameterization remain somewhat subjective. The statement that inverse causation only occurs in

of models should be seen as qualitatively consistent with the intuition that such situations are rare, but not quantitatively significant.

## Estimating the causal coefficient

Randomization of an independent variable effectively cuts all incoming edges to that node in a causal diagram, removing potential confounding effects. Reporting a correlation coefficient, in the context of a randomized controlled trial, can be viewed as reporting an estimate of the causation coefficient, with the distribution of interventions, , equal to the distribution of interventions that were performed in the experiment.

When randomization is not available, it may still be possible calculate a sample causal coefficient by estimating . Presented here is a simple example, using data from a study on the treatment of kidney stones charig1986. More advanced techniques for identifying are given in shpitser2008.

The subgroups () in Table 15 refer to kidney stone size. Group 1 is small kidney stones; group 2 is large kidney stones. This study can be modeled with binary treatment () and response () variables, with the decision to perform percutaneous nephrolithotomy (PCNL) as and surgery as .

The naive model is that there is no confounding (Figure 6a). In such a case, the population natural causation coefficient equals the population correlation coefficient and therefore the sample correlation coefficient is equal to the sample causation coefficient .101010Statisticians normally denote an estimate of a parameter, , with a hat symbol, . This convention is not followed here, since the hat symbol has been used to indicate causal effect. Given the data, . The cohorts are equal in size, so this is also an estimate of the maximum entropy causation coefficient. Figure 6: Some of the possible causal diagrams for modeling kidney stone treatment

Hypothetically, if the subgroups were postoperative infection and the treatment affected the likelihood of postoperative infection, which, in turn, affected recovery (Figure 6b), the natural causation coefficient would still equal the correlation coefficient – adjusting for subgroups would still be incorrect. This is an immediate consequence of the do-calculus shpitser2008.

However, the correct set of causal assumptions is that kidney stone size affects treatment and recovery (Figure 6c) – doctors took kidney stone size into account when making the decision whether or not to send a patient to surgery. Correctly estimating the causation coefficient in this model can be done with an adjustment for direct causes, . With respect to the correct causal diagram, estimating the causation coefficient yields . This a case of inverse causation and the best treatment option for patients is the opposite of what a naive interpretation of correlation would suggest.

The reversal effect seen here is well known as Simpson’s paradox, but requires causal knowledge to resolve correctly pearl2014. Adjusting for the wrong variables will produce incorrect estimates of causal effect.

## Conclusions

There are many different ways in which positive correlation can be misleading with respect to causation. Population distributions may exhibit common causation () or inverse causation (). Sampling error introduces the possibility of coincidental correlation (). Unfaithfulness () implies that the absence of correlation cannot guarantee the absence of causation. And even if there is no confounding (), human error introduces the possibility of reverse causation ().

Despite the warning that, “Correlation is not causation”, the two are easy to conflate because of the high likelihood that a random model will have . However, there remains a nontrivial possibility of encountering other correlation/causation relationships such as inverse causation, a problem that no amount of additional data sampling will mitigate. There is simply no substitute for accurate causal assumptions.

By emphasizing the population/sample and statistical/causal distinctions and explicitly naming the different ways in which correlation can relate to causation, it is hoped that these effects will become easier to recognize in practice.

## Appendix

Theorem.  For Bernoulli , , if and only if is invariant to .

Proof. Consider the definition of average treatment effect, . Average treatment effect is zero if and only if . Since the support of a Bernoulli random variable is , this is equivalent to invariant to . Since has the same sign as the average treatment effect, if and only if is invariant to .

Theorem.  For Bernoulli , :

Proof.  Consider the numerator of . For Bernoulli random variables:

 P(y∣do(x=1))^P(x=1)−^P(x=1)( P(y=1∣do(x=1))^P(x=1) +P(y=1∣do(x=0))^P(x=0))
 =^P(x=1)(P(y=1∣do(x=1)) −^P(x=1)P(y=1∣do(x=1)) −^P(x=0)P(y=1∣do(x=0)))
 =^P(x=1)( P(y=1∣do(x=1))−^P(x=1)P(y=1∣do(x=1)) −(1−^P(x=1))P(y=1∣do(x=0)))
 =^P(x=1)( P(y=1∣do(x=1))−P(y=1∣do(x=0)) −^P(x=1)(P(y=1∣do(x=1))−P(y=1∣do(x=0))))
 = (P(y=1∣do(x=1))−P(y=1∣do(x=0)))^P(x=1)(1−^P(x=1)) = ATE(X→Y)Var[^X]

Therefore, .

Theorem.  In all recursive probabilistic causal models, at least one of is zero.

Proof.  Assume without loss of generality that is nonzero. This implies that is not invariant to . Since is a nonconstant function of , must be an ancestor of in the associated causal diagram. Consider the submodel that results from . Since, is an ancestor of in the original model, and must be d-separated in . Therefore, is invariant to and is zero.

Theorem.  If and are unfaithful in causal model , then the observational distribution and causal diagram associated with violate the faithfulness condition.

Proof.  Assume without loss of generality that is not invariant to . Therefore, is an ancestor of in the associated causal diagram and and are d-connected pearl2009. However, and are independent, an independence relation not entailed by the Markov condition. Therefore the observational distribution is not faithful to .

## Acknowledgements

Thanks to James Reggia, Brendan Good, Donald Gregorich and Richard Bruns for their comments on drafts of this paper.

[title=References]