Causal Discovery Using Proxy Variables

by   Mateo Rojas-Carulla, et al.

Discovering causal relations is fundamental to reasoning and intelligence. In particular, observational causal discovery algorithms estimate the cause-effect relation between two random entities X and Y, given n samples from P(X,Y). In this paper, we develop a framework to estimate the cause-effect relation between two static entities x and y: for instance, an art masterpiece x and its fraudulent copy y. To this end, we introduce the notion of proxy variables, which allow the construction of a pair of random entities (A,B) from the pair of static entities (x,y). Then, estimating the cause-effect relation between A and B using an observational causal discovery algorithm leads to an estimation of the cause-effect relation between x and y. For example, our framework detects the causal relation between unprocessed photographs and their modifications, and orders in time a set of shuffled frames from a video. As our main case study, we introduce a human-elicited dataset of 10,000 pairs of casually-linked pairs of words from natural language. Our methods discover 75 in machine learning, as a general tool to incorporate static knowledge into prediction tasks.


page 4

page 5


Causal Inference via Kernel Deviance Measures

Discovering the causal structure among a set of variables is a fundament...

Using Noisy Extractions to Discover Causal Knowledge

Knowledge bases (KB) constructed through information extraction from tex...

Temporal Reasoning with Probabilities

In this paper we explore representations of temporal knowledge based upo...

Discovering Causal Signals in Images

This paper establishes the existence of observable footprints that revea...

Do learned representations respect causal relationships?

Data often has many semantic attributes that are causally associated wit...

Generalizing Goal-Conditioned Reinforcement Learning with Variational Causal Reasoning

As a pivotal component to attaining generalizable solutions in human int...

1 Introduction

Discovering causal relations is a central task in science (Pearl, 2009; Beebee et al., 2009), and empowers humans to explain their experiences, predict the outcome of their interventions, wonder about what could have happened but never did, or plan which decisions will shape the future to their maximum benefit. Causal discovery is essential to the development of common-sense (Kuipers, 1984; Waldrop, 1987). In machine learning, it has been argued that causal discovery algorithms are a necessary step towards machine reasoning (Bottou, 2014; Bottou et al., 2013; Lopez-Paz, 2016)

and artificial intelligence

(Lake et al., 2016).

The gold standard to discover causal relations is to perform active interventions (also called experiments) in the system under study (Pearl, 2009). However, interventions are in many situations expensive, unethical, or impossible to realize. In all of these situations, there is a prime need to discover and reason about causality purely from observation. Over the last decade, the state-of-the-art in observational causal discovery has matured into a wide array of algorithms (Shimizu et al., 2006; Hoyer et al., 2009; Daniusis et al., 2012; Peters et al., 2014; Mooij et al., 2016; Lopez-Paz et al., 2015; Lopez-Paz, 2016)

. All these algorithms estimate the causal relations between the random variables

by estimating various asymmetries in . In the interest of simplicity, this paper considers the problem of discovering the causal relation between two variables and , given samples from .

The methods mentioned estimate the causal relation between two random entities and , but often we are interested instead in two static entities and

. These are a pair of single objects for which it is not possible to define a probability distribution directly. Examples of such static entities may include one art masterpiece and its fraudulent copy, one translated document and its original version, or one pair of causally linked words in natural language, such as “virus” and “death”. Looking into the distant future, an algorithm able to discover the causal structure between static entities in natural language could read throughout medical journals, and discover the causal mechanisms behind a new cure for a specific disease–the very goal of the ongoing $45 million dollar

Big Mechanism DARPA initiative (Cohen, 2015). Or, if we were able to establish the causal relation between two arbitrary natural language statements, we could tackle general-AI tasks such as the Winograd schema challenge (Levesque et al., 2012), which are out-of-reach for current algorithms. The above and many more are situations where causal discovery between static entities is at demand.

Our Contributions

First, we introduce the framework of proxy variables to estimate the causal relation between static entities (Section 3).

Second, we apply our framework to the task of inferring the cause-effect relation between pairs of images (Section 4). In particular, our methods are able to infer the causal relation between an image and its stylistic modification in of the cases, and it can recover the correct ordering of a set of shuffled video frames (Section 4.2).

Third, we apply our framework to discover the cause-effect relation between pairs of words in natural language (Section 5). To this end, we introduce a novel dataset of 10,000 human-elicited pairs of words with known causal relation (Section 5.2). Our methods are able to recover of the cause-effect relations (such as “accident injury” or “sentence trial”) in this challenging task (Section 5.4).

Fourth, we discuss the role of proxy variables as a tool to incorporate external knowledge, as provided by static entities, into general prediction problems (Section 6).

All our code and data are available at anonymous.

We start the exposition by introducing the basic language of observational causal discovery, as well as motivating its role in machine learning.

2 Causal Discovery in Machine Learning

The goal of observational causal discovery is to reveal the cause-effect relation between two random variables and , given samples from . In particular, we say that “ causes ” if there exists a mechanism that transforms the values taken by the cause into the values taken by the effect , up to the effects of some random noise . Mathematically, we write . Such equation highlights an asymmetric assignment rather than a symmetric equality. If we were to intervene and change the value of the cause , then a change in the value of the effect would follow. On the contrary, if we were to manipulate the value of the effect , a change in the cause would not follow.

When two random variables share a causal relation, they often become statistically dependent. However, when two random variables are statistically dependent, they do not necessarily share a causal relation. This is at the origin of the famous warning “dependence does not imply causality”. This relation between dependence and causality was formalized by Reichenbach (1956) into the following principle.

Principle 1 (Principle of common cause).

If two random variables and are statistically dependent (), then one of the following causal explanations must hold:

  • causes (write ), or

  • causes (write ), or

  • there exists a random variable that is the common cause of both and (write ).

In the third case, and are conditionally independent given (write ).

(a) , .
(b) , .
Figure 1: Example of an Additive Noise Model (ANM).

In machine learning, these three types of statistical dependencies are exploited without distinction, as dependence is sufficient to perform optimal predictions about identically and independently distributed (iid) data (Schölkopf et al., 2012). However, we argue that taking into account the Principle of common cause would have far-reaching benefits in non-iid machine learning. For example, assume that we are interested in predicting the values of a target variable , given the values taken by two features . Then, understanding the causal structure underlying brings two benefits.

First, interpretability. Explanatory questions such as “Why does when ?”, and counterfactual questions such as “What value would have taken, had ?” cannot be answered using statistics alone, since their answers depend on the particular causal structure underlying the data.

Second, robustness. Predictors which estimate the values taken by a target variable given only its direct causes are robust with respect to distributional shifts on their inputs. For example, let , , and . Then, the predictor

is invariant to changes in the joint distribution

as long as the causal mechanism does not change. However, the predictor can vary wildly even if the causal mechanism (the only one involved in computing ) does not change (Peters et al., 2016; Rojas-Carulla et al., 2015).

The previous two points apply to the common “non-iid” situations where we have access to data drawn from some distribution , but we are interested in some different but related distribution . One natural way to phrase and leverage the similarities between and is in terms of shared causal structures (Peters, 2012; Lopez-Paz, 2016).

While it is indeed an attractive endeavor, discovering the causal relation between two random variables purely from observation is an impossible task when considered in full generality. Indeed, any of the three causal structures outlined in Principle 1 could explain the observed dependency between two random variables. However, one can in many cases impose assumptions to render the causal relation between two variables identifiable from their joint distribution. For example, consider the family of Additive Noise Models, or ANM (Hoyer et al., 2009; Peters et al., 2014; Mooij et al., 2016). In ANM, one assumes that the causal model has the form , where . It turns out that, under some assumptions, the reverse ANM will not satisfy the independence assumption (Fig. 1). The statistical dependence shared by the cause and noise in the wrong causal direction is the footprint that renders the causal relation between and identifiable from statistics alone.

In situations where the ANM assumption is not satisfied (e.g., multiplicative or heteroskedastic noise) one may prefer learning-based causal discovery tools, such as the Randomized Causation Coefficient (Lopez-Paz et al., 2015). RCC assumes access to a causal dataset , where is a bag of examples drawn from some distribution , if , and if . By featurizing each of the training distribution samples using kernel mean embeddings (Smola et al., 2007)

, RCC learns a binary classifier on

to reveal the causal footprints necessary to classify new pairs of random variables.

However, both ANM and RCC based methods need samples from to classify the causal relation between the random variables and . Therefore, these methods are not suited to infer the causal relation between static entities such as, for instance, one painting and its fraudulent copy. In the following section, we propose a framework to extend the state-of-the-art in causal discovery methods to this important case.

3 The Main Concepts: Static Entities, Proxy Variables and Proxy Projections

In the following, we consider two static entities in some space that satisfy the relation “ causes ”. Formally, this causal relation manifests the existence of a (possibly noisy) mechanism such that the value is computed as . This asymmetric assignment guarantees changes in the static cause would lead to changes in the static effect , but the converse would not hold.

As mentioned previously, traditional causal discovery methods cannot be directly applied to static entities. In order to discover the causal relation between the pair of static entities and , we introduce two main concepts: proxy variables , and proxy projections .

First, a proxy random variable is a random variable taking values in some set , which can be understood as a random source of information related to and . This definition is on purpose rather vague and will be illustrated through several examples in the following sections.

Second, a proxy projection is a function . Using a proxy variable and projection, we can construct a pair of scalar random variables and . A proxy variable and projection are causal if the pair of random entities share the same causal footprint as the pair of static entities .111The concept of causal footprint is relative to our assumptions. For instance, when assuming an ANM , the causal footprint is the statistical independence between and .

If the proxy variable and projection are causal, we may estimate the cause-effect relation between the static entities and in three steps. First, draw from . Second, use an observational causal discovery algorithm to estimate the cause-effect relation between and given . Third, conclude “ causes ” if , or “ causes ” if . This process is summarized in Figure 2.

Note that the causal relation does not imply the causal relation in the interventional sense: even if is a copy of and is a copy of , intervening on will not change ! We only care here about the presence of statistically observable causal footprints between the variables. Furthermore, our framework extends readily to the case where and live in different modalities (say, is an image and is a piece of audio describing the image). In this case, all we need is a proxy variable and a pair of proxy projections with the appropriate structure. For simplicity and throughout this paper, we will choose our proxy variables and projections based on domain knowledge. Learning proxy variables and projections from data is an exciting area left for future research.

Figure 2: A pair of static entities share a causal relation of interest (thick blue arrow). A proxy variable , together with a proxy projection produces the random entities , that share the causal footprint of , denoted by the dotted blue arrow.

4 Causal Discovery Using Proxies in Images

Consider the two images shown in Figure 3. The image on the left is an unprocessed photograph of the Tübingen Neckarfront, while the one on the right is the same photograph after being stylized with the algorithm of Gatys et al. (2016). From a causal point of view, the unprocessed image is the cause of the stylized image . How can we leverage the ideas from Section 3 to recover such causal relation?

The following is one possible solution. Assume that the two images are represented by pixel intensity vectors

and , respectively. For and :

  • Draw a mask-image , which contains ones inside a patch at random coordinates, and zeroes elsewhere.

  • Compute , and .

Figure 3: Sampling random patches at paired locations produces a proxy variable to discover the causal relation between two images.

This process returns a sample drawn from , the joint distribution of the two scalar random variables . The conversion from static entities to random variables is obtained by virtue of i) the randomness generated by the proxy variable , which in this particular case is incarnated as random masks and ii) a causal projection , here a simple dot product.

At this point, if the causal footprint between the random entities resembles the causal footprint between and , we can apply a regular causal discovery algorithm to to estimate the causal relation between and .

4.1 Towards a Theoretical Understanding

The intuition behind causal discovery using proxy variables is that, although we observe as static entities, these are underlyingly complex, high-dimensional, structured objects that carry rich information about their causal relation. The proxy variable introduces randomness to sample different views of the high-dimensional causal structures, and summarizes those views into scalar values. But why should the causal footprint of these summaries cue the causal relation between and ?

We formalize this question for the specific case of stylized images, where is the original image and its stylized version. Let the causal mechanism mapping to operate locally. More precisely, assume that each -subset in the stylized image is computed from the -subset in the original image, as described by the ANM:

Then, the stylized image , where . For simplicity, assume that where is a matrix and acts element-wise. Then, let be a distribution over masks extracting random -subsets, and let , to obtain:

where , and where we assume that is such that for all . Since , the pair also follows an ANM. We leave for future work the investigation on identifiability conditions for causal inference using proxy variables.

4.2 Numerical Simulations

In order to illustrate the use of causal discovery using proxy variables in images, we conducted two small experiments. In these experiments, we extract square patches of size pixels, and use the Additive Noise Model (Hoyer et al., 2009) to estimate the causal relation between the constructed scalar random variables and .

First, we collected unprocessed images together with stylizations (including the one from Figure 3), made using the algorithm of Gatys et al. (2016). When applying causal discovery using proxy variables to this dataset, we can correctly identify the correct direction of causation from the original image to its stylized version in of the cases.

Figure 4: Causal discovery using proxy variables uncovers the causal time signal to reorder a shuffled sequence of video frames.

Second, we decomposed a video of drops of ink mixing with water into frames , shown in Figure 4. Using the same mask proxy variable as above, we construct an matrix such that if according to our method and otherwise. Then, we consider to be the adjacency matrix of the causal DAG describing the causal structure between the frames. By employing topological sort on this graph, we were able to obtain the true ordering, unique among the possible orderings.

5 Causal Discovery Using Proxies in NLP

As our main case study, consider discovering the causal relation between pairs of words appearing in a large corpus of natural language. For instance, given the pair of words (virus, death), which represent our static entities and , together with a large corpus of natural language, we want to recover causal relations such as “virus death”, “sun radiation”, “trial sentence”, or “drugs addiction”.

This problem is extremely challenging for two reasons. First, word pairs are extremely varied in nature (compare “avocado causes guacamole” to “cat causes purr”), and some are very rare (“wrestler causes pin”). Second, the causal relation between two words can always be tweaked in context-specific ways. For instance, one can construct sentences where “virus causes death” (e.g., the virus led to a quick death), but also sentences where “death causes virus” (e.g., the multiple deaths in the area further spread the virus). We are hereby interested in the canonical causal relation between pairs of words, assumed by human subjects when specific contexts are not provided (see Section 5.2

). Furthermore, our interest lies in discovering the causal relation between pairs of words without the use of language-specific knowledge or heuristics. To the contrary, we aim to discover such causal relations by using generic observational causal discovery methods, such as the ones described in Section 


In the following, Section 5.1 frames this problem in the language of causal discovery between static entities. Then, Section 5.2 introduces a novel, human-generated, human-validated dataset to test our methods. Section 5.3 reviews prior work on causal discovery in language. Finally, Section 5.4 presents experiments evaluating our methods.

5.1 Static Entities, Proxies, and Projections for NLP

In the language of causal discovery with proxies, a pair of words is a pair of static entities: . In order to discover the causal relation between and , we are in need of a proxy variable , as introduced in Section 3. We will use a simple proxy: let be the probability of the word appearing in a sentence drawn at random from a large corpus of natural language.

Using the proxy , we need to define the pair of random variables and in terms of a causal projection . Once we have defined the causal projection , we can sample , construct , , and apply a causal discovery algorithm to the sample . Specifically, we estimate from a large corpus of natural language, and sample words without replacement.222

This is equivalent to sampling approximately the top 10,000 most frequent words in the corpus. Due to the extremely skewed nature of word frequency distributions

(Baayen, 2001), sampling with replacement would produce a list of very frequent words such as a and the, sampled many times.

Throughout our experimental evaluation, we will use and compare different proxy projections :

  • , where is the input word2vec representation (Mikolov et al., 2013) of the word . The dot-product measures the similarity in meaning between the pair of words .

  • , where is the output word2vec representation of the word . The dot-product is an unnormalized estimate of the conditional probability (Melamud et al., 2015).

  • , an unnormalized estimate of the conditional probability .

  • , where the pmf is directly estimated from counting within-sentence co-occurrences in the corpus.

  • similar to the one above, but computed only over sentences where precedes .

  • , where the pmfs , , and are estimated from counting words and (sentence-based) co-occurrences in the corpus. The log of this quantity is known as point-wise mutual information, or PMI (Church & Hanks, 1990).

  • , similar to the one above, but computed only over sentences where precedes .

Applying the causal projections to our sample from proxy , we construct the -vector


and similarly for , where


In particular, we use the skip-gram model implementation of fastText (Bojanowski et al., 2016) to compute dimensional word2vec representations.

5.2 A Real-World Dataset of Cause-Effect Words

We introduce a human-elicited, human-filtered dataset of pairs of words with a known causal relation. This dataset was constructed in two steps:

  • We asked workers from Amazon Mechanical Turk to create pairs of words linked by a causal relation. We provided the turks with examples of words with a clear causal link (such as “sun causes radiation”) and examples of related words not sharing a causal relation (such as “knife” and “fork”). For details, see Appendix A.

  • Each of the pairs collected from the previous step was randomly shuffled and submitted to different turks, none of whom had created any of the word pairs. Each turk was required to classify the pair of words as “ causes ”, “ causes ”, or “ and do not share a causal relation”. For more details, see Appendix B.

This procedure resulted in a dataset of causal word pairs , each accompanied with three numbers: the number of turks that voted “ causes ”, the number of turks that voted “ causes ”, and the number of turks that voted “ and do not share a causal relation”.

5.3 Causal Relation Discovery in NLP

The NLP community has devoted much attention to the problem of identifying the semantic relation holding between two words, with causality as a special case. Girju et al. (2009) discuss the results of the large shared task on relation classification they organized (their benchmark included only 220 examples of cause-effect). The task required recognizing relations in context, but, as discussed by the authors, most contexts display the default relation we are after here (e.g., “The mutant virus gave him a severe flu” instantiates the default relation in which virus is the cause, flu is the effect). All participating systems used extra resources, such as ontologies and syntactic parsing, on top of corpus data. They are thus outside the scope of the purely corpus-based methods we are considering here.

Most NLP work specifically focusing on the causality relation relies on informative linking patterns co-occurring with the target pairs (such as, most obviously, the conjunction because). These patterns are extracted and processed with sophisticated methods, involving annotation, ontologies, bootstrapping and/or manual filtering (see, e.g., Blanco et al. 2008 Hashimoto et al. 2012, Radinsky et al. 2012, and references therein). We experimented with extracting linking patterns from our corpus, but, due to the relatively small size of the latter, results were extremely sparse (note that patterns can only be extracted from sentences in which both cause and effect words occur). More recent work started looking at causal chains of events as expressed in text (see Mirza & Tonelli 2016 and references therein). Applying our generic method to this task is a direction for future work.

A semantic relation that received particular attention in NLP is that of entailment between words (dog entails animal

). As causality is intuitively related to entailment, we will apply below entailment detection methods to cause/effect classification. Most lexical entailment methods rely on distributional representations of the words in the target pair. Traditionally, entailing pairs have been identified with unsupervised asymmetric similarity measures applied to distributed word representations

(Geffet & Dagan, 2005; Kotlerman et al., 2010; Lenci & Benotto, 2012; Weeds et al., 2004). We will test one of these related measures, namely, Weeds Precision (WS). More recently, Santus et al. 2014 showed that the relative entropy of distributed vectors representing the words in a pair is an effective cue to which word is entailing the other, and we also look at entropy for our task. However, the most effective method to detect entailment is to apply a supervised classifier to the concatenation of the vectors representing the words in a pair (Baroni et al., 2012; Roller et al., 2014; Weeds et al., 2014).

5.4 Experiments

We evaluate a variety of methods to discover the causal relation between two words appearing in a large corpus of natural language. We study methods that fall within three categories: baselines, distribution-based causal discovery methods, and feature-based supervised methods. These three families of methods consider an increasing amount of information about the task at hand, and therefore exhibit an increasing performance up to classification accuracy.

All our computations will be based on the full English Wikipedia, as post-processed by Matt Mahoney (see We study the pairs of words out of from the dataset described in Section 5.2 that achieved a consensus across at least out of turks. We use RCC to estimate the causal relation between pairs of random variables.

5.4.1 Baselines

These are a variety of unsupervised, heuristic baselines. Each baseline computes two scores, denoted by and , predicting if , and if . The baselines are:

  • frequency: is the number of sentences where appears in the corpus, and is the number of sentences where appears in the corpus.

  • precedence: considering only sentences from the corpus where both and appear, is the number of sentences where occurs before , and is the number of sentences where occurs before .

  • counts (entropy): is the entropy of , and is the entropy of , as defined in (1).

  • counts (WS): Using the WS measure of Weeds & Weir (2003), , and .

  • prec-counts (entropy): is the entropy of , and is the entropy of (1).

  • prec-counts (WS): analogous to the previous.

The baselines PMI (entropy), PMI (WS), prec-PMI (entropy), prec-PMI (WS) are analogous to the last four, but use instead of . Figure 5 shows the performance of these baselines in blue.

5.4.2 Distribution-based causal discovery methods

These methods implement our framework of causal discovery using proxy variables. They classify samples from a 2-dimensional probability distribution as a whole. Recall that a vocabulary drawn from the proxy is available. Given word pairs , this family of methods constructs a dataset , where , , if and otherwise. In short, is a dataset of “scatterplots” annotated with binary labels. The -th scatterplot contains 2-dimensional points, which are obtained by applying the causal projection to both and , against the vocabulary words drawn from the proxy.

The samples are computed using a deterministic projection of iid draws from the proxy, meaning that . Therefore, we could permute the points inside each scatterplot without altering the results of these methods. In principle, we could also remove some of the points in the scatterplot without a significant drop in performance. Therefore, these methods search for causal footprints at the 2-dimensional distribution level, and we term them distribution-based causal discovery methods.

The methods in this family first split the dataset into a training set and a test set . Then, the methods train RCC on the training set , and test its classification accuracy on . This process is repeated ten times, splitting at random into a training set containing of the pairs, and a test set containing of the pairs. Each method builds on top of a causal projection from (2) above. Figure 5 shows the test accuracy of these methods in green.

5.4.3 Feature-based supervised methods

These methods use the same data generated by our causal projections, but treat them as fixed-size vectors fed to a generic classifier, rather than random samples to be analyzed with an observational causal discovery method. They can be seen as an oracle to upper-bound the amount of causal signals (and signals correlated to causality) contained in our data. Specifically, they use -dimensional vectors given by the concatenation of those in (1). Given word pairs , they build a dataset , where if , if , and “proj” is a projection from (2). Next, we split the dataset into a training set containing of the pairs, and a disjoint test set containing

of the pairs. To evaluate the accuracy of each method in this family, we train a random forest of

trees using , and report its classification accuracy over . This process is repeated ten times, by splitting the dataset at random. The results are presented as red bars in Figure 5. We also report the classification accuracy of training the random forest on the raw word2vec representations of the pair of words (top three bars).

5.4.4 Discussion of results

Baseline methods are the lowest performing, up to test accuracy. We believe that the performance of the best baseline, precedence, is due to the fact that most Wikipedia is written in the active voice, which often aligns with the temporal sequence of events, and thus correlates with causality.

The feature-based methods perform best, achieving up to test classification accuracy. However, feature-based methods enjoy the flexibility of considering each of the elements in the causal projection as a distinct feature. Therefore, feature-based methods do not focus on patterns to be found at a distributional level (such as causality), and are vulnerable to permutation or removal of features. We believe that feature-based methods may achieve their superior performance by overfitting to biases in our dataset, which are not necessarily related to causality.

Impressively, the best distribution-based causal discovery method achieves test classification accuracy, which is a significant improvement over the best baseline method. Importantly, our distribution-based methods take a whole -dimensional distribution as input to the classifier; as such, these methods are robust with respect to permutations and removals of the distribution samples. We find it encouraging that the best distribution-based method is the one based on . This suggests the intuitive interpretation that the distribution of a vocabulary conditioned on the cause word causes the distribution of the vocabulary conditioned on the effect word. Even more encouragingly, Figure 6 shows a positive dependence between the test classification accuracy of RCC and the confidence of human annotations, when considering the test classification accuracy of all the causal pairs annotated with a human confidence of at least . Thus, our proxy variables and projections arguably capture a notion of causality aligned with the one of human annotators.

Figure 5: Results for all methods on the NLP experiment. Accuracies above are statistically significant with respect to a Binomial test at a significance level .
Figure 6: RCC accuracy versus human confidence.

6 Proxy Variables in Machine Learning

The central concept in this paper is the one of proxy variable. This is a variable providing a random source of information related to and .

However, we can consider the reverse process of using a static entity to augment random statistics about a pair of random variables and . As it turns out, this could be an useful process in general prediction problems.

To illustrate, consider a supervised learning problem mapping a

feature random variable into a target random variable . Such problem is often solved by considering a sample . In this scenario, we may contemplate an unpaired, external, static source of information (such as a memory), which might help solving the supervised learning problem at hand. One could incorporate the information in the static source by constructing the proxy projection , and add them to the dataset to obtain to build the predictor .

7 Conclusion

We have introduced the necessary machinery to estimate the causal relation between pairs of static entities and — one piece of art and its forgery, one document and its translation, or the concepts underlying a pair of words appearing in a corpus of natural language. We have done so by introducing the tool of proxy variables and projections, reducing our problem to one of observational causal inference between random entities. Throughout a variety of experiments, we have shown the empirical effectiveness of our proposed method, and we have connected it to the general problem of incorporating external sources of knowledge as additional features in machine learning problems.


  • Baayen (2001) Baayen, H. Word Frequency Distributions. Kluwer, 2001.
  • Baroni et al. (2012) Baroni, M., Bernardi, R., Do, N.-Q., and Shan, C.-C. Entailment above the word level in distributional semantics. In EACL, 2012.
  • Beebee et al. (2009) Beebee, H., Hitchcock, C., and Menzies, P. The Oxford handbook of causation. Oxford University Press, 2009.
  • Blanco et al. (2008) Blanco, E., Castell, N., and Moldovan, D. Causal relation extraction. In LREC, 2008.
  • Bojanowski et al. (2016) Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. Enriching word vectors with subword information. arXiv, 2016.
  • Bottou (2014) Bottou, L. From machine learning to machine reasoning. Machine learning, 2014.
  • Bottou et al. (2013) Bottou, L., Peters, J., Charles, D. X., Chickering, M., Portugaly, E., Ray, D., Simard, P. Y., and Snelson, E. Counterfactual reasoning and learning systems: the example of computational advertising. JMLR, 2013.
  • Church & Hanks (1990) Church, K. and Hanks, P. Word association norms, mutual information, and lexicography. Computational linguistics, 1990.
  • Cohen (2015) Cohen, P. R. DARPA’s Big Mechanism program. Physical biology, 2015.
  • Daniusis et al. (2012) Daniusis, P., Janzing, D., Mooij, J., Zscheischler, J., Steudel, B., Zhang, K., and Schölkopf, B. Inferring deterministic causal relations. arXiv, 2012.
  • Gatys et al. (2016) Gatys, L. A., Ecker, A. S., and Bethge, M.

    Image style transfer using convolutional neural networks.

    In CVPR, 2016.
  • Geffet & Dagan (2005) Geffet, M. and Dagan, I. The distributional inclusion hypotheses and lexical entailment. In ACL, 2005.
  • Girju et al. (2009) Girju, R., Nakov, P., Nastase, V., Szpakowicz, S., Turney, P., and Yuret, D. Classification of semantic relations between nominals. Language Resources and Evaluation, 2009.
  • Hashimoto et al. (2012) Hashimoto, C., Torisawa, K., De Saeger, S., Oh, J.-H., and Kazama, J. Excitatory or inhibitory: A new semantic orientation extracts contradiction and causality from the web. In EMNLP, 2012.
  • Hoyer et al. (2009) Hoyer, P., Janzing, D., Mooij, J., Peters, J., and Schölkopf, B. Nonlinear causal discovery with additive noise models. In NIPS, 2009.
  • Kotlerman et al. (2010) Kotlerman, L., Dagan, I., Szpektor, I., and Zhitomirsky-Geffet, M. Directional distributional similarity for lexical inference. Natural Language Engineering, 2010.
  • Kuipers (1984) Kuipers, B. Commonsense reasoning about causality: deriving behavior from structure. Artificial intelligence, 1984.
  • Lake et al. (2016) Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman, S. J. Building machines that learn and think like people. arXiv, 2016.
  • Lenci & Benotto (2012) Lenci, A. and Benotto, G. Identifying hypernyms in distributional semantic spaces. In *SEM, 2012.
  • Levesque et al. (2012) Levesque, H., Davis, E., and Morgenstern, L. The Winograd Schema Challenge. In KR, 2012.
  • Lopez-Paz (2016) Lopez-Paz, D. From dependence to causation. PhD thesis, University of Cambridge, 2016.
  • Lopez-Paz et al. (2015) Lopez-Paz, D., Muandet, K., Schölkopf, B., and Tolstikhin, I. Towards a learning theory of cause-effect inference. In ICML, 2015.
  • Melamud et al. (2015) Melamud, O., Levy, O., Dagan, I., and Ramat-Gan, I. A simple word embedding model for lexical substitution. In Workshop on Vector Space Modeling for Natural Language Processing, 2015.
  • Mikolov et al. (2013) Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of word representations in vector space. arXiv, 2013.
  • Mirza & Tonelli (2016) Mirza, P. and Tonelli, S. CATENA: CAusal and TEmporal relation extraction from NAtural language texts. In COLING, 2016.
  • Mooij et al. (2016) Mooij, J. M., Peters, J., Janzing, D., Zscheischler, J., and Schölkopf, B. Distinguishing cause from effect using observational data: methods and benchmarks. JMLR, 2016.
  • Pearl (2009) Pearl, J. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2nd edition, 2009.
  • Peters (2012) Peters, J. Restricted structural equation models for causal inference. PhD thesis, ETH Zurich, 2012.
  • Peters et al. (2014) Peters, J., Mooij, J., Janzing, D., and Schölkopf, B. Causal discovery with continuous additive noise models. JMLR, 2014.
  • Peters et al. (2016) Peters, J., Bühlmann, P., and Meinshausen, N.

    Causal inference using invariant prediction: identification and confidence intervals.

    JRSS B, 2016.
  • Radinsky et al. (2012) Radinsky, K., Davidovich, S., and Markovitch, S. Learning causality for news events prediction. In WWW, 2012.
  • Reichenbach (1956) Reichenbach, H. The direction of time, 1956.
  • Rojas-Carulla et al. (2015) Rojas-Carulla, M., Schölkopf, B., Turner, R., and Peters, J. Causal transfer in machine learning. arXiv, 2015.
  • Roller et al. (2014) Roller, S., Erk, K., and Boleda, G. Inclusive yet selective: Supervised distributional hypernymy detection. In COLING, 2014.
  • Santus et al. (2014) Santus, E., Lenci, A., Lu, Q., and Schulte im Walde, S. Chasing hypernyms in vector spaces with entropy. In EACL, 2014.
  • Schölkopf et al. (2012) Schölkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., and Mooij, J. On causal and anticausal learning. In ICML, 2012.
  • Shimizu et al. (2006) Shimizu, S., Hoyer, P., Hyvärinen, A., and Kerminen, A. A linear non-gaussian acyclic model for causal discovery. JMLR, 2006.
  • Smola et al. (2007) Smola, A., Gretton, A., Song, L., and Schölkopf, B. A Hilbert space embedding for distributions. In ALT. Springer, 2007.
  • Waldrop (1987) Waldrop, M. M. Causality, structure, and common sense. Science, 1987.
  • Weeds & Weir (2003) Weeds, J. and Weir, D. A general framework for distributional similarity. In EMNLP, 2003.
  • Weeds et al. (2004) Weeds, J., Weir, D., and McCarthy, D. Characterising measures of lexical distributional similarity. In COLING, 2004.
  • Weeds et al. (2014) Weeds, J., Clarke, D., Reffin, J., Weir, D., and Keller, B. Learning to distinguish hypernyms and co-hyponyms. In COLING, 2014.

Appendix A Instructions for word pair creators

We will ask you to write word pairs (for instance, WordA and WordB) for which you believe the statement “WordA causes WordB” is true.

To provide us with high quality word pairs, we ask you to follow these indications:

  • All word pairs must have the form “WordA WordB”. It is essential that the first word (WordA) is the cause, and the second word (WordB) is the effect.

  • WordA and WordB must be one word each (no spaces, and no “recessive gene red hair”). Avoid compound words such as “snow-blind”.

  • In most situations, you may come up with a word pair that can be justified both as “WordA WordB” and “WordB WordA”. In such situations, prefer the causal direction with the easiest explanation. For example, consider the word pair “virus death”. Most people would agree that “virus causes death“. However, “death causes virus” can be true in some specific scenario (for example, “because of all the deaths in the region, a new family of virus emerged.”). However, the explanation “virus causes death“ is preferred, because it is more general and depends less on the context.

  • We do not accept word pairs with an ambiguous causal relation, such as “book - paper”.

  • We do not accept simple variations of word pairs. For example, if you wrote down “dog bark”, we will not credit you for other pairs such as “dogs bark” or “dog barking”.

  • Use frequent words (avoid strange words such as “clithridiate”).

  • Do not rely on our examples, and use your creativity. We are grateful if you come up with diverse word pairs! Please do not add any numbers (for example, “1 - dog bark”). For your guidance, we provide you examples of word pairs that belong to different categories. Please bear in mind that we will reward your creativity: therefore, focus on providing new word pairs with an evident causal direction, and do not limit yourself to the categories shown below.

1) Physical phenomenon: there exists a clear physical mechanism that explains why “WordA WordB”.

  • sun radiation (The sun is a source of radiation. If the sun were not present, then there would be no radiation.)

  • altitude temperature

  • winter cold

  • oil energy

2) Events and consequences: WordA is an action or event, and WordB is a consequence of that action or event.

  • crime punishment

  • accident death

  • smoking cancer

  • suicide death

  • call ring

3) Creator and producer: WordA is a creator or producer, WordB is the creation of the producer.

  • writer book (the creator is a person)

  • painter painting

  • father son

  • dog bark

  • bacteria sickness

  • pen drawing (the creator is an object)

  • chef food

  • instrument music

  • bomb destruction

  • virus death

4) Other categories! Up to you, please use your creativity!

  • fear scream

  • age salary

Appendix B Instructions for word pair validators

Please classify the relation between pairs of words A and B into one of three categories: either “A causes B”, “B causes A”, or “Non-causal or unrelated”.

For example, given the pair of words “virus and death”, the correct answer would be:

  • virus causes death (correct);

  • death causes virus (wrong);

  • non-causal or unrelated (wrong).

Some of the pairs that will be presented are non-causal. This may happen if:

  • The words are unrelated, like “toilet and beach”.

  • The words are related, but there is no clear causal direction. This is the case of “salad and lettuce”, since we can eat salad without lettuce, or eat lettuce in a burger.

To provide us with high quality categorization of word pairs, we ask you to follow these indications:

  • Prefer the causal direction with the simplest explanation. Most people would agree that “virus causes death”. However, “death causes virus” can be true in some specific scenario (for example, “because of all the deaths in the region, a new virus emerged.”). However, the explanation “virus causes death” is preferred, because it is true in more general contexts.

  • If no direction is clearer, mark the pair as non-causal. Here, conservative is good!

  • Think twice before deciding. We will present the pairs in random order!

Please classify all the presented pairs. If one or more has not been answered, the whole batch will be invalid. PLEASE DOUBLE CHECK THAT YOU HAVE ANSWERED ALL 40 WORD PAIRS.

Examples of causal word pairs:

  • “sun and radiation”: sun causes radiation

  • “energy and oil”: oil causes energy

  • “punishment and crime”: crime causes punishment

  • “instrument and music”: instrument causes music

  • “age and salary”: age causes salary

Examples of non-causal word pairs:

  • “video and games”: non-causal or unrelated

  • “husband and wife”: non-causal or unrelated

  • “salmon and shampoo”: non-causal or unrelated

  • “knife and gun”: non-causal or unrelated

  • “sport and soccer”: non-causal or unrelated