Towards Causal Representation Learning

The two fields of machine learning and graphical causality arose and developed separately. However, there is now cross-pollination and increasing interest in both fields to benefit from the advances of the other. In the present paper, we review fundamental concepts of causal inference and relate them to crucial open problems of machine learning, including transfer and generalization, thereby assaying how causality can contribute to modern machine learning research. This also applies in the opposite direction: we note that most work in causality starts from the premise that the causal variables are given. A central problem for AI and causality is, thus, causal representation learning, the discovery of high-level causal variables from low-level observations. Finally, we delineate some implications of causality for machine learning and propose key research areas at the intersection of both communities.



page 10


Causality for Machine Learning

Graphical causal inference as pioneered by Judea Pearl arose from resear...

Systematic Evaluation of Causal Discovery in Visual Model Based Reinforcement Learning

Inducing causal relationships from observations is a classic problem in ...

Causality Learning: A New Perspective for Interpretable Machine Learning

Recent years have witnessed the rapid growth of machine learning in a wi...

Towards a Solution to Bongard Problems: A Causal Approach

To date, Bongard Problems (BP) remain one of the few fortresses of AI hi...

A Causal Research Pipeline and Tutorial for Psychologists and Social Scientists

Causality is a fundamental part of the scientific endeavour to understan...

A Survey of Learning Causality with Data: Problems and Methods

The era of big data provides researchers with convenient access to copio...

Invariance, Causality and Robustness

We discuss recent work for causal inference and predictive robustness in...

Code Repositories


A manifesto for temporal computing where memory indexing is replaced by temporal delays and referencing.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

If we compare what machine learning can do to what animals accomplish, we observe that the former is rather limited at some crucial feats where natural intelligence excels. These include transfer to new problems and any form of generalization that is not from one data point to the next (sampled from the same distribution), but rather from one problem to the next — both have been termed generalization, but the latter is a much harder form thereof, sometimes referred to as horizontal, strong, or out-of-distribution

generalization. This shortcoming is not too surprising, given that machine learning often disregards information that animals use heavily: interventions in the world, domain shifts, temporal structure — by and large, we consider these factors a nuisance and try to engineer them away. In accordance with this, the majority of current successes of machine learning boil down to large scale pattern recognition on suitably collected

independent and identically distributed (i.i.d.) data.

To illustrate the implications of this choice and its relation to causal models, we start by highlighting key research challenges.

Issue 1 – Robustness

With the widespread adoption of deep learning approaches in computer vision 

[he2016deep, krizhevsky2012imagenet]

, natural language processing 

[devlin2018bert], and speech recognition [graves2013speech]

, a substantial body of literature explored the robustness of the prediction of state-of-the-art deep neural network architectures. The underlying motivation originates from the fact that in the real world there is often little control over the distribution from which the data comes from. In computer vision  

[geirhos2018imagenet, shetty2019not], changes in the test distribution may, for instance, come from aberrations like camera blur, noise or compression quality [hendrycks2019benchmarking, karahan2016image, michaelis2019benchmarking, roy2018effects], or from shifts, rotations, or viewpoints [azulay2019deep, barbu2019objectnet, engstrom2017exploring, zhang2019making]. Motivated by this, new benchmarks were proposed to specifically test generalization of classification and detection methods with respect to simple algorithmically generated interventions like spatial shifts, blur, changes in brightness or contrast [hendrycks2019benchmarking, michaelis2019benchmarking], time consistency [gu2019using, shankarimage], control over background and rotation [barbu2019objectnet], as well as images collected in multiple environments [beery2018recognition]. Studying the failure modes of deep neural networks from simple interventions has the potential to lead to insights into the inductive biases of state-of-the-art architectures. So far, there has been no definitive consensus on how to solve these problems, although progress has been made using data augmentation, pre-training, self-supervision, and architectures with suitable inductive biases w.r.t. a perturbation of interest [tangent_prop, djolonga2020robustness, engstrom2017exploring, kolesnikov2019big, michaelis2019benchmarking, roy2018effects]. It has been argued [PetJanSch17] that such fixes may not be sufficient, and generalizing well outside the i.i.d. setting requires learning not mere statistical associations between variables, but an underlying causal model. The latter contains the mechanisms giving rise to the observed statistical dependences, and allows to model distribution shifts through the notion of interventions [Pearl2009, Spirtes2000, Schoelkopf2012, Bottou2013, PetJanSch17, ParKilRojSch18].

Issue 2 – Learning Reusable Mechanisms

Infants’ understanding of physics relies upon objects that can be tracked over time and behave consistently [dehaene2020we, spelke1990principles]. Such a representation allows children to quickly learn new tasks as their knowledge and intuitive understanding of physics can be re-used [battaglia2013simulation, dehaene2020we, lake2017building, teglas2011pure]. Similarly, intelligent agents that robustly solve real-world tasks need to re-use and re-purpose their knowledge and skills in novel scenarios. Machine learning models that incorporate or learn structural knowledge of an environment have been shown to be more efficient and generalize better [battaglia2016interaction, bapst2019structured, battaglia2018relational, RIMs, rahaman2021spatially, santoro2017simple, bahdanau2018systematic, zambaldi2018deep, berner2019dota, gondal2019transfer, goyal2020object, kulkarni2019unsupervised, locatello2019fairness, mrowca2018flexible, sanchez2020learning, sun2019stochastic, vinyals2019grandmaster, yi2019clevrer, dittadi2021on, parascandolo2021learning]. In a modular representation of the world where the modules correspond to physical causal mechanisms, many modules can be expected to behave similarly across different tasks and environments. An agent facing a new environment or task may thus only need to adapt a few modules in its internal representation of the world [SchJanLop16, RIMs]. When learning a causal model, one should thus require fewer examples to adapt as most knowledge, i.e., modules, can be re-used without further training.

A Causality Perspective

Causation is a subtle concept that cannot be fully described using the language of Boolean logic [lewis1974causation] or that of probabilistic inference; it requires the additional notion of intervention [Spirtes2000, Pearl2009]. The manipulative definition of causation [Spirtes2000, Pearl2009, imbens2015causal]

focuses on the fact that conditional probabilities (“seeing people with open umbrellas suggests that it is raining”) cannot reliably predict the outcome of an active intervention (“closing umbrellas does not stop the rain”). Causal relations can also be viewed as the components of reasoning chains 

[lewis1974causation] that provide predictions for situations that are very far from the observed distribution and may even remain purely hypothetical [Lorenz73, Pearl2009] or require conscious deliberation [kahneman2011thinking]. In that sense, discovering causal relations means acquiring robust knowledge that holds beyond the support of an observed data distribution and a set of training tasks, and it extends to situations involving forms of reasoning.

Our Contributions:  

In the present paper, we argue that causality, with its focus on representing structural knowledge about the data generating process that allows interventions and changes, can contribute towards understanding and resolving some limitations of current machine learning methods. This would take the field a step closer to a form of artificial intelligence that involves

thinking in the sense of Konrad Lorenz, i.e., acting in an imagined space [Lorenz73]. Despite its success, statistical learning provides a rather superficial description of reality that only holds when the experimental conditions are fixed. Instead, the field of causal learning seeks to model the effect of interventions and distribution changes with a combination of data-driven learning and assumptions not already included in the statistical description of a system. The present work reviews and synthesizes key contributions that have been made to this endThe present paper expands [1911.10500], leading to partial text overlap.:

  • [leftmargin=10pt]

  • We describe different levels of modeling in physical systems in Section II and present the differences between causal and statistical models in Section III. We do so not only in terms of modeling abilities but also discuss the assumptions and challenges involved.

  • We expand on the Independent Causal Mechanisms (ICM) principle as a key component that enables the estimation of causal relations from data in Section

    IV. In particular, we state the Sparse Mechanism Shift hypothesis as a consequence of the ICM principle and discuss its implications for learning causal models.

  • We review existing approaches to learn causal relations from appropriate descriptors (or features) in Section V. We cover both classical approaches and modern re-interpretations based on deep neural networks, with a focus on the underlying principles that enable causal discovery.

  • We discuss how useful models of reality may be learned from data in the form of causal representations, and discuss several current problems of machine learning from a causal point of view in Section VI.

  • We assay the implications of causality for practical machine learning in Section VII

    . Using causal language, we revisit robustness and generalization, as well as existing common practices such as semi-supervised learning, self-supervised learning, data augmentation, and pre-training. We discuss examples at the intersection between causality and machine learning in scientific applications and speculate on the advantages of combining the strengths of both fields to build a more versatile AI.

Ii Levels of Causal Modeling

Model Predict in i.i.d. Predict under distr. Answer counter- Obtain Learn from
setting shift/intervention factual questions physical insight data
Mechanistic/physical yes yes yes yes ?
Structural causal yes yes yes ? ?
Causal graphical yes yes no ? ?
Statistical yes no no no yes
TABLE I: A simple taxonomy of models. The most detailed model (top) is a mechanistic or physical one, usually in terms of differential equations. At the other end of the spectrum (bottom), we have a purely statistical model; this can be learned from data, but it often provides little insight beyond modeling associations between epiphenomena. Causal models can be seen as descriptions that lie in between, abstracting away from physical realism while retaining the power to answer certain interventional or counterfactual questions.

The gold standard for modeling natural phenomena is a set of coupled differential equations modeling physical mechanisms responsible for the time evolution. This allows us to predict the future behavior of a physical system, reason about the effect of interventions, and predict statistical dependencies between variables that are generated by coupled time evolution. It also offers physical insights, explaining the functioning of the system, and lets us read off its causal structure. To this end, consider the coupled set of differential equations


with initial value . The Picard–Lindelöf theorem states that at least locally, if is Lipschitz, there exists a unique solution . This implies in particular that the immediate future of is implied by its past values.

If we formally write this in terms of infinitesimal differentials and , we get:


From this, we can ascertain which entries of the vector

mathematically determine the future of others

. This tells us that if we have a physical system whose physical mechanisms are correctly described using such an ordinary differential equation (

1), solved for (i.e., the derivative only appears on the left-hand side), then its causal structure can be directly read off.111Note that this requires that the differential equation system describes the causal physical mechanisms. If, in contrast, we considered a set of differential equations that phenomenologically correctly describe the time evolution of a system without capturing the underlying mechanisms (e.g., due to unobserved confounding, or a form of course-graining that does not preserve the causal structure [Rubensteinetal17]), then (2) may not be causally meaningful [1911.10500, peters2020causal].

While a differential equation is a rather comprehensive description of a system, a statistical model can be viewed as a much more superficial one. It often does not refer to dynamic processes; instead, it tells us how some of the variables allow prediction of others as long as experimental conditions do not change. E.g., if we drive a differential equation system with certain types of noise, or we average over time, then it may be the case that statistical dependencies between components of emerge, and those can then be exploited by machine learning. Such a model does not allow us to predict the effect of interventions; however, its strength is that it can often be learned from observational data, while a differential equation usually requires an intelligent human to come up with it. Causal modeling lies in between these two extremes. Like models in physics, it aims to provide understanding and predict the effect of interventions. However, causal discovery and learning try to arrive at such models in a data-driven way, replacing expert knowledge with weak and generic assumptions. The overall situation is summarized in Table I, adapted from [PetJanSch17]. Below, we address some of the tasks listed in Table I in more detail.

Ii-a Predicting in the i.i.d. setting

Statistical models are a superficial description of reality as they are only required to model associations. For a given set of input examples and target labels , we may be interested in approximating to answer questions like: “what is the probability that this particular image contains a dog?” or “what is the probability of heart failure given certain diagnostic measurements (e.g., blood pressure) carried out on a patient?”. Subject to suitable assumptions, these questions can be provably answered by observing a sufficiently large amount of i.i.d. data from  [Vapnik98]. Despite the impressive advances of machine learning, causality offers an under-explored complement: accurate predictions may not be sufficient to inform decision making. For example, the frequency of storks is a reasonable predictor for human birth rates in Europe [matthews2000storks]. However, as there is no direct causal link between those two variables, a change to the stork population would not affect the birth rates, even though a statistical model may predict so. The predictions of a statistical model are only accurate within identical experimental conditions. Performing an intervention changes the data distribution, which may lead to (arbitrarily) inaccurate predictions [Pearl2009, Spirtes2000, Schoelkopf2012, PetJanSch17].

Ii-B Predicting Under Distribution Shifts

Interventional questions are more challenging than predictions as they involve actions that take us out of the usual i.i.d. setting of statistical learning. Interventions may affect both the value of a subset of causal variables and their relations. For example, “is increasing the number of storks in a country going to boost its human birth rate?” and “would fewer people smoke if cigarettes were more socially stigmatized?”. As interventions change the joint distribution of the variables of interest, classical statistical learning guarantees 

[Vapnik98] no longer apply. On the other hand, learning about interventions may allow to train predictive models that are robust against the changes in distribution that naturally happen in the real world. Here, interventions do not need to be deliberate actions to achieve a goal. Statistical relations may change dynamically over time (e.g., people’s preferences and tastes) or there may simply be a mismatch between a carefully controlled training distribution and the test distribution of a model deployed in production. The robustness of deep neural networks has recently been scrutinized and become an active research topic related to causal inference. We argue that predicting under distribution shift should not be reduced to just the accuracy on a test set. If we wish to incorporate learning algorithms into human decision making, we need to trust that the predictions of the algorithm will remain valid if the experimental conditions are changed.

Ii-C Answering Counterfactual Questions

Counterfactual problems involve reasoning about why things happened, imagining the consequences of different actions in hindsight, and determining which actions would have achieved a desired outcome. Answering counterfactual questions can be more difficult than answering interventional questions. However, this may be a key challenge for AI, as an intelligent agent may benefit from imagining the consequences of its actions as well as understanding in retrospect what led to certain outcomes, at least to some degree of approximation.222Note that the two types of questions occupy a continuum: to this end, consider a probability which is both conditional and interventional . If is the empty set, we have a classical intervention; if contained all (unobserved) noise terms, we have a counterfactual. If is not identical to the noise terms, but nevertheless informative about them, we get something in between. For instance, reinforcement learning practitioners may call functions as providing counterfactuals, even though they model (return from agent state at time , (action at time )), and therefore closer to an intervention (which is why they can be estimated from data).

We have above mentioned the example of statistical predictions of heart failure. An interventional question would be “how does the probability of heart failure change if we convince a patient to exercise regularly?” A counterfactual one would be “would a given patient have suffered heart failure if they had started exercising a year earlier?”. As we shall discuss below, counterfactuals, or approximations thereof, are especially critical in reinforcement learning. They can enable agents to reflect on their decisions and formulate hypotheses that can be empirically verified in a process akin to the scientific method.

Ii-D Nature of Data: Observational, Interventional, (Un)structured

The data format plays a substantial role in which type of relation can be inferred. We can distinguish two axes of data modalities: observational versus interventional, and hand-engineered versus raw (unstructured) perceptual input.

Observational and Interventional Data: an extreme form of data which is often assumed but seldom strictly available is observational i.i.d. data, where each data point is independently sampled from the same distribution. Another extreme is interventional data with known interventions, where we observe data sets sampled from multiple distributions each of which is the result of a known intervention. In between, we have data with (domain) shifts or unknown interventions. This is observational in the sense that the data is only observed passively, but it is interventional in the sense that there are interventions/shifts, but unknown to us.

Hand Engineered Data vs. Raw Data: especially in classical AI, data is often assumed to be structured into high-level and semantically meaningful variables which may partially (modulo some variables being unobserved) correspond to the causal variables of the underlying graph. Raw Data, in contrast, is unstructured and does not expose any direct information about causality.

While statistical models are weaker than causal models, they can be efficiently learned from observational data alone on both hand-engineered features and raw perceptual input such as images, videos, speech etc. On the other hand, although methods for learning causal structure from observations exist [Spirtes2000, PetJanSch17, Shimizu2006, Hoyer2008, Mooij2009, PetMooJanSch14, Kpotufe14, BauSchPet16, Sun2006, Zhang2009, Mooij11, Janzing2009uai, Peters2011b, Mooijetal16, Vreeken, 1711.08936, LopMuaSchTol15], learning causal relations frequently requires collecting data from multiple environments, or the ability to perform interventions [Tian2001]. In some cases, it is assumed that all common causes of measured variables are also observed (causal sufficiency).333There are also algorithms that do not require causal sufficiency [Spirtes2000]. Overall, a significant amount of prior knowledge is encoded in which variables are measured. Moving forward, one would hope to develop methods that replace expert data collection with suitable inductive biases and learning paradigms such as meta-learning and self-supervision. If we wish to learn a causal model that is useful for a particular set of tasks and environments, the appropriate granularity of the high-level variables depends on the tasks of interest and on the type of data we have at our disposal, for example which interventions can be performed and what is known about the domain.

Iii Causal Models and Inference

As discussed, reality can be modeled at different levels, from the physical one to statistical associations between epiphenomena. In this section, we expand on the difference between statistical and causal modeling and review a formal language to talk about interventions and distribution changes.

Iii-a Methods driven by i.i.d. data

The machine learning community has produced impressive successes with machine learning applications to big data problems [LeCBenHin15, mnih2015human, schrittwieser2019mastering, silver2016mastering, deng2009imagenet]. In these successes, there are several trends at work [schoelkopf15]: (1) we have massive amounts of data, often from simulations or large scale human labeling, (2) we use high capacity machine learning systems (i.e., complex function classes with many adjustable parameters), (3) we employ high-performance computing systems, and finally (often ignored, but crucial when it comes to causality) (4) the problems are i.i.d. The latter can be guaranteed by the construction of a task including training and test set (e.g., image recognition using benchmark datasets). Alternatively, problems can be made approximately i.i.d., e.g.. by carefully collecting the right training set for a given application problem, or by methods such as “experience replay” [mnih2015human] where a reinforcement learning agent stores observations in order to later permute them for the purpose of re-training.

For i.i.d. data, strong universal consistency results from statistical learning theory apply, guaranteeing convergence of a learning algorithm to the lowest achievable risk. Such algorithms do exist, for instance, nearest neighbor classifiers, support vector machines, and neural networks

[Vapnik98, SchSmo02, SteChr08, farago2006strong]. Seen in this light, it is not surprising that we can indeed match or surpass human performance if given enough data. However, current machine learning methods often perform poorly when faced with problems that violate the i.i.d. assumption, yet seem trivial to humans. Vision systems can be grossly misled if an object that is normally recognized with high accuracy is placed in a context that in the training set may be negatively correlated with the presence of the object. Distribution shifts may also arise from simple corruptions that are common in real-world data collection pipelines [Baird90, hendrycks2019benchmarking, karahan2016image, michaelis2019benchmarking, roy2018effects]. An example of this is the impact of socio-economic factors in clinics in Thailand on the accuracy of a detection system for Diabetic Retinopathy [beede2020human]. More dramatically, the phenomenon of “adversarial vulnerability” [1312.6199] highlights how even tiny but targeted violations of the i.i.d. assumption, generated by adding suitably chosen perturbations to images, imperceptible to humans, can lead to dangerous errors such as confusion of traffic signs. Overall, it is fair to say that much of the current practice (of solving i.i.d. benchmark problems) and most theoretical results (about generalization in i.i.d. settings) fail to tackle the hard open challenge of generalization across problems.

To further understand how the i.i.d. assumption is problematic, let us consider a shopping example. Suppose Alice is looking for a laptop rucksack on the internet (i.e., a rucksack with a padded compartment for a laptop). The web shop’s recommendation system suggests that she should buy a laptop to go along with the rucksack. This seems odd because she probably already has a laptop, otherwise she would not be looking for the rucksack in the first place. In a way, the laptop is the cause, and the rucksack is an effect. Now suppose we are told whether a customer has bought a laptop. This reduces our uncertainty about whether she also bought a laptop rucksack, and vice versa –– and it does so by the same amount (the

mutual information), so the directionality of cause and effect is lost. However, the directionality is present in the physical mechanisms generating statistical dependence, for instance the mechanism that makes a customer want to buy a rucksack once she owns a laptop.444Note that the physical mechanisms take place in time, and if available, time order may provide additional information about causality. Recommending an item to buy constitutes an intervention in a system, taking us outside the i.i.d. setting. We no longer work with the observational distribution, but a distribution where certain variables or mechanisms have changed.

Iii-B The Reichenbach Principle: From Statistics to Causality

Reichenbach1956 clearly articulated the connection between causality and statistical dependence. He postulated: [colback=black!0!white] Common Cause Principle: if two observables and are statistically dependent, then there exists a variable that causally influences both and explains all the dependence in the sense of making them independent when conditioned on . As a special case, this variable can coincide with or . Suppose that is the frequency of storks and the human birth rate. If storks bring the babies, then the correct causal graph is . If babies attract storks, it is . If there is some other variable that causes both (such as economic development), we have .

Without additional assumptions, we cannot distinguish these three cases using observational data. The class of observational distributions over and that can be realized by these models is the same in all three cases. A causal model thus contains genuinely more information than a statistical one.

While causal structure discovery is hard if we have only two observables [PetMooJanSch14], the case of more observables is surprisingly easier, the reason being that in that case, there are nontrivial conditional independence properties [Spohn78, Dawid79, GeiPea90] implied by causal structure. These generalize the Reichenbach Principle and can be described by using the language of causal graphs or structural causal models, merging probabilistic graphical models and the notion of interventions [Spirtes2000, Pearl2009]. They are best described using directed functional parent-child relationships rather than conditionals. While conceptually simple in hindsight, this constituted a major step in the understanding of causality.

Iii-C Structural causal models (SCMs)

The SCM viewpoint considers a set of observables (or variables) associated with the vertices of a directed acyclic graph (DAG). We assume that each observable is the result of an assignment


using a deterministic function depending on ’s parents in the graph (denoted by ) and on an unexplainedrandom variable . Mathematically, the observables are thus random variables, too. Directed edges in the graph represent direct causation, since the parents are connected to by directed edges and through (3) directly affect the assignment of . The noise ensures that the overall object (3) can represent a general conditional distribution , and the set of noises are assumed to be jointly independent. If they were not, then by the Common Cause Principle there should be another variable that causes their dependence, and thus our model would not be causally sufficient.

If we specify the distributions of , recursive application of (3) allows us to compute the entailed observational joint distribution . This distribution has structural properties inherited from the graph [Lauritzen1996, Pearl2009]: it satisfies the causal Markov condition stating that conditioned on its parents, each is independent of its non-descendants.

Intuitively, we can think of the independent noises as “information probes” that spread through the graph (much like independent elements of gossip can spread through a social network). Their information gets entangled, manifesting itself in a footprint of conditional dependencies making it possible to infer aspects of the graph structure from observational data using independence testing. Like in the gossip analogy, the footprint may not be sufficiently characteristic to pin down a unique causal structure. In particular, it certainly is not if there are only two observables, since any nontrivial conditional independence statement requires at least three variables. The two-variable problem can be addressed by making additional assumptions, as not only the graph topology leaves a footprint in the observational distribution, but the functions do, too. This point is interesting for machine learning, where much attention is devoted to properties of function classes (e.g., priors or capacity measures), and we shall return to it below.

Causal Graphical Models

The graph structure along with the joint independence of the noises implies a canonical factorization of the joint distribution entailed by (3) into causal conditionals that we refer to as the causal (or disentangled) factorization,


While many other entangled factorizations are possible, e.g.,


the factorization (4) yields practical computational advantages during inference, which is in general hard, even when it comes to non-trivial approximations [russell2002artificial]. But more interestingly, it is the only one that decomposes the joint distribution into conditionals corresponding to the structural assignments (3). We think of these as the causal mechanisms that are responsible for all statistical dependencies among the observables. Accordingly, in contrast to (5), the disentangled factorization represents the joint distribution as a product of causal mechanisms.

Latent variables and Confounders

Variables in a causal graph may be unobserved, which can make causal inference particularly challenging. Unobserved variables may confound two observed variables so that they either appear statistically related while not being causally related (i.e., neither of the variables is an ancestor of the other), or their statistical relation is altered by the presence of the confounder (e.g., one variable is a causal ancestor for the other, but the confounder is a causal ancestor of both). Confounders may or may not be known or observed.


The SCM language makes it straightforward to formalize interventions as operations that modify a subset of assignments (3), e.g., changing , setting (and thus ) to a constant, or changing the functional form of (and thus the dependency of on its parents) [Spirtes2000, Pearl2009].

Several types of interventions may be possible [eaton2007exact] which can be categorized as: No intervention: only observational data is obtained from the causal model. Hard/perfect: the function in the structural assignment (3) of a variable (or, analogously, of multiple variables) is set to a constant (implying that the value of the variable is fixed), and then the entailed distribution for the modified SCM is computed. Soft/imperfect: the structural assignment (3) for a variable is modified by changing the function or the noise term (this corresponds to changing the conditional distribution given its parents). Uncertain: the learner is not sure which mechanism/variable is affected by the intervention.

One could argue that stating the structural assignments as in (3) is not yet sufficient to formulate a causal model. In addition, one should specify the set of possible interventions on the structural causal model. This may be done implicitly via the functional form of structural equations by allowing any intervention over the domain of the mechanisms. This becomes relevant when learning a causal model from data, as the SCM depends on the interventions. Pragmatically, we should aim at learning causal models that are useful for specific sets of tasks of interest [Rubensteinetal17, weichwald2019pragmatism] on appropriate descriptors (in terms of which causal statements they support) that must either be provided or learned. We will return to the assumptions that allow learning causal models and features in Section IV.

Fig. 1:

Difference between statistical (left) and causal models (right) on a given set of three variables. While a statistical model specifies a single probability distribution, a causal model represents a set of distributions, one for each possible intervention (indicated with a

in the figure).

Iii-D Difference Between Statistical Models, Causal Graphical Models, and SCMs

An example of the difference between a statistical and a causal model is depicted in Figure 1. A statistical model may be defined for instance through a graphical model, i.e., a probability distribution along with a graph such that the former is Markovian with respect to the latter (in which case it can be factorized as (4)). However, the edges in a (generic) graphical model do not need to be causal [GuyonJS2010]. For instance, the two graphs and imply the same conditional independence(s) ( and are independent given ). They are thus in the same Markov equivalence class, i.e., if a distribution is Markovian w.r.t. one of the graphs, then it also is w.r.t. the other graph. Note that the above serves as an example that the Markov condition is not sufficient for causal discovery. Further assumptions are needed, cf. below and [Spirtes2000, Pearl2009, PetJanSch17].

A graphical model becomes causal if the edges of its graph are causal (in which case the graph is referred to as a “causal graph”), cf. (3). This allows to compute interventional distributions as depicted in Figure 1. When a variable is intervened upon, we disconnect it from its parents, fix its value, and perform ancestral sampling on its children.

A structural causal model is composed of (i) a set of causal variables and (ii) a set of structural equations with a distribution over the noise variables (or a set of causal conditionals). While both causal graphical models and SCMs allow to compute interventional distributions, only the SCMs allow to compute counterfactuals. To compute counterfactuals, we need to fix the value of the noise variables. Moreover, there are many ways to represent a conditional as a structural assignment (by picking different combinations of functions and noise variables).

Causal Learning and Reasoning

The conceptual basis of statistical learning is a joint distribution (where often one of the

is a response variable denoted as

), and we make assumptions about function classes used to approximate, say, a regression . Causal learning considers a richer class of assumptions, and seeks to exploit the fact that the joint distribution possesses a causal factorization (4). It involves the causal conditionals (e.g., represented by the functions and the distribution of in (3)), how these conditionals relate to each other, and interventions or changes that they admit. Once a causal model is available, either by external human knowledge or a learning process, causal reasoning allows to draw conclusions on the effect of interventions, counterfactuals and potential outcomes. In contrast, statistical models only allow to reason about the outcome of i.i.d. experiments.

Iv Independent Causal Mechanisms

We now return to the disentangled factorization (4) of the joint distribution . This factorization according to the causal graph is always possible when the are independent, but we will now consider an additional notion of independence relating the factors in (4) to one another.

Whenever we perceive an object, our brain assumes that the object and the mechanism by which the information contained in its light reaches our brain are independent. We can violate this by looking at the object from an accidental viewpoint, which can give rise to optical illusions [PetJanSch17]. The above independence assumption is useful because in practice, it holds most of the time, and our brain thus relies on objects being independent of our vantage point and the illumination. Likewise, there should not be accidental coincidences, such as 3D structures lining up in 2D, or shadow boundaries coinciding with texture boundaries. In vision research, this is called the generic viewpoint assumption.

If we move around the object, our vantage point changes, but we assume that the other variables of the overall generative process (e.g., lighting, object position and structure) are unaffected by that. This is an invariance implied by the above independence, allowing us to infer 3D information even without stereo vision (“structure from motion”).

For another example, consider a dataset that consists of altitude and average annual temperature of weather stations [PetJanSch17]. and are correlated, which we believe is due to the fact that the altitude has a causal effect on temperature. Suppose we had two such datasets, one for Austria and one for Switzerland. The two joint distributions may be rather different since the marginal distributions over altitudes will differ. The conditionals , however, may be (close to) invariant, since they characterize the physical mechanisms that generate temperature from altitude. This similarity is lost upon us if we only look at the overall joint distribution, without information about the causal structure . The causal factorization will contain a component that generalizes across countries, while the entangled factorization will exhibit no such robustness. Cum grano salis, the same applies when we consider interventions in a system. For a model to correctly predict the effect of interventions, it needs to be robust to generalizing from an observational distribution to certain interventional distributions.

One can express the above insights as follows [Schoelkopf2012, PetJanSch17]: [colback=black!0!white] Independent Causal Mechanisms (ICM) Principle.The causal generative process of a system’s variables is composed of autonomous modules that do not inform or influence each other. In the probabilistic case, this means that the conditional distribution of each variable given its causes (i.e., its mechanism) does not inform or influence the other mechanisms.

This principle entails several notions important to causality, including separate intervenability of causal variables, modularity and autonomy of subsystems, and invariance [Pearl2009, PetJanSch17]. If we have only two variables, it reduces to an independence between the cause distribution and the mechanism producing the effect distribution.

Applied to the causal factorization (4), the principle tells us that the factors should be independent in the sense that

  • changing (or performing an intervention upon) one mechanism does not change any of the other mechanisms () [Schoelkopf2012], and

  • knowing some other mechanisms () does not give us information about a mechanism [JanSch10].

This notion of independence thus subsumes two aspects: the former pertaining to influence, and the latter to information.

The notion of invariant, autonomous, and independent mechanisms has appeared in various guises throughout the history of causality research [Haavelmo1944, Frisch1948, Hoover06, Pearl2009, JanSch10, Steudel2010a, PetJanSch17]. Early work on this was done by Haavelmo1944, stating the assumption that changing one of the structural assignments leaves the other ones invariant. Hoover06 attributes to Herb Simon the invariance criterion: the true causal order is the one that is invariant under the right sort of intervention. Aldrich89 discusses the historical development of these ideas in economics. He argues that the “most basic question one can ask about a relation should be: How autonomous is it?” [Frisch1948, preface]. Pearl2009

discusses autonomy in detail, arguing that a causal mechanism remains invariant when other mechanisms are subjected to external influences. He points out that causal discovery methods may best work “in longitudinal studies conducted under slightly varying conditions, where accidental independencies are destroyed and only structural independencies are preserved.” Overviews are provided by

Aldrich89, Hoover06, Pearl2009, and PetJanSch17. These seemingly different notions can be unified [JanSch10, Steudel2010a].

We view any real-world distribution as a product of causal mechanisms. A change in such a distribution (e.g., when moving from one setting/domain to a related one) will always be due to changes in at least one of those mechanisms. Consistent with the implication (a) of the ICM Principle, we state the following hypothesis:

[colback=black!0!white] Sparse Mechanism Shift (SMS).Small distribution changes tend to manifest themselves in a sparse or local way in the causal/disentangled factorization (4), i.e., they should usually not affect all factors simultaneously.

In contrast, if we consider a non-causal factorization, e.g., (5), then many, if not all, terms will be affected simultaneously as we change one of the physical mechanisms responsible for a system’s statistical dependencies. Such a factorization may thus be called entangled, a term that has gained popularity in machine learning [1206.5538, higgins2016beta, 1811.12359, Suter.1811.00007].

The SMS hypothesis was stated in [ParKilRojSch18, bengio2019meta, 1911.10500, JMLR:v21:19-232], and in earlier form in [Schoelkopf2012, zhang_domain_2013, SchJanLop16]. An intellectual ancestor is Simon’s invariance criterion, i.e., that the causal structure remains invariant across changing background conditions [Simon53]. The hypothesis is also related to ideas of looking for features that vary slowly [foldiak1991learning, Wiskott2002]. It has recently been used for learning causal models [ke2019learning], modular architectures [RIMs, BesSunJanSch21] and disentangled representations [locatello2020weakly].

We have informally talked about the dependence of two mechanisms and when discussing the ICM Principle and the disentangled factorization (4). Note that the dependence of two such mechanisms does not coincide with the statistical dependence of the random variables and . Indeed, in a causal graph, many of the random variables will be dependent even if all mechanisms are independent. Also, the independence of the noise terms does not translate into the independence of the . Intuitively speaking, the independent noise terms provide and parameterize the uncertainty contained in the fact that a mechanism is non-deterministic,555In the sense that the mapping from to is described by a non-trivial conditional distribution, rather than by a function. and thus ensure that each mechanism adds an independent element of uncertainty. In this sense, the ICM Principle contains the independence of the unexplained noise terms in an SCM (3) as a special case.

In the ICM Principle, we have stated that independence of two mechanisms (formalized as conditional distributions) should mean that the two conditional distributions do not inform or influence each other. The latter can be thought of as requiring that independent interventions are possible. To better understand the former, we next discuss a formalization in terms of algorithmic independence. In a nutshell, we encode each mechanism as a bit string, and require that joint compression of these strings does not save space relative to independent compressions.

To this end, first recall that we have so far discussed links between causal and statistical structures. Of the two, the more fundamental one is the causal structure, since it captures the physical mechanisms that generate statistical dependencies in the first place. The statistical structure is an epiphenomenon that follows if we make the unexplained variables random. It is awkward to talk about statistical information contained in a mechanism since deterministic functions in the generic case neither generate nor destroy information. This serves as a motivation to devise an alternative model of causal structures in terms of Kolmogorov complexity [JanSch10]

. The Kolmogorov complexity (or algorithmic information) of a bit string is essentially the length of its shortest compression on a Turing machine, and thus a measure of its information content. Independence of mechanisms can be defined as vanishing mutual algorithmic information; i.e., two conditionals are considered independent if knowing (the shortest compression of) one does not help us achieve a shorter compression of the other.

Algorithmic information theory provides a natural framework for non-statistical graphical models [JanSch10, JanChaSch16]. Just like the latter are obtained from structural causal models by making the unexplained variables random, we obtain algorithmic graphical models by making the bit strings, jointly independent across nodes, and viewing as the output of a fixed Turing machine running the program on the input . Similar to the statistical case, one can define a local causal Markov condition, a global one in terms of d-separation, and an additive decomposition of the joint Kolmogorov complexity in analogy to (4), and prove that they are implied by the structural causal model [JanSch10]. Interestingly, in this case, independence of noises and independence of mechanisms coincide, since the independent programs play the role of the unexplained noise terms. This approach shows that causality is not intrinsically bound to statistics.

V Causal Discovery and Machine Learning

Let us turn to the problem of causal discovery from data. Subject to suitable assumptions such as faithfulness [Spirtes2000], one can sometimes recover aspects of the underlying graph666One can recover the causal structure up to a Markov equivalence class, where DAGs have the same undirected skeleton and “immoralities” (). from observational data by performing conditional independence tests. However, there are several problems with this approach. One is that our datasets are always finite in practice, and conditional independence testing is a notoriously difficult problem, especially if conditioning sets are continuous and multi-dimensional. So while, in principle, the conditional independencies implied by the causal Markov condition hold irrespective of the complexity of the functions appearing in an SCM, for finite datasets, conditional independence testing is hard without additional assumptions [1804.07203]. Recent progress in (conditional) independence testing heavily relies on kernel function classes to represent probability distributions in reproducing kernel Hilbert spaces [Gretton2005, Gretton2005JMLR, Fukumizu2008, Zhang2011uai, DorMuaZhaSch14, PfiBuhSchPet18, 1804.02747]. The other problem is that in the case of only two variables, the ternary concept of conditional independence collapses and the Markov condition thus has no nontrivial implications.

It turns out that both problems can be addressed by making assumptions on function classes. This is typical for machine learning, where it is well-known that finite-sample generalization without assumptions on function classes is impossible. Specifically, although there are universally consistent learning algorithms, i.e., approaching minimal expected error in the infinite sample limit, there are always cases where this convergence is arbitrarily slow. So for a given sample size, it will depend on the problem being learned whether we achieve low expected error, and statistical learning theory provides probabilistic guarantees in terms of measures of complexity of function classes [DevGyoLug96, Vapnik98].

Returning to causality, we provide an intuition why assumptions on the functions in an SCM should be necessary to learn about them from data. Consider a toy SCM with only two observables . In this case, (3) turns into


with . Now think of acting as a random selector variable choosing from among a set of functions . If depends on in a non-smooth way, it should be hard to glean information about the SCM from a finite dataset, given that is not observed and its value randomly selects among arbitrarily different .

This motivates restricting the complexity with which depends on . A natural restriction is to assume an additive noise model


If in (7) depends smoothly on , and if is relatively well concentrated, this can be motivated by a local Taylor expansion argument. It drastically reduces the effective size of the function class — without such assumptions, the latter could depend exponentially on the cardinality of the support of . Restrictions of function classes not only make it easier to learn functions from data, but it turns out that they can break the symmetry between cause and effect in the two-variable case: one can show that given a distribution over generated by an additive noise model, one cannot fit an additive noise model in the opposite direction (i.e., with the roles of and interchanged) [Hoyer2008, Mooij2009, PetMooJanSch14, Kpotufe14, BauSchPet16], cf. also [Sun2006]. This is subject to certain genericity assumptions, and notable exceptions include the case where are Gaussian and is linear. It generalizes results of Shimizu2006 for linear functions, and it can be generalized to include non-linear rescalings [Zhang2009], loops [Mooij11], confounders [Janzing2009uai], and multi-variable settings [Peters2011b]. Empirically, there is a number of methods that can detect causal direction better than chance [Mooijetal16], some of them building on the above Kolmogorov complexity model [Vreeken], some on generative models [1711.08936], and some directly learning to classify bivariate distributions into causal vs. anticausal [LopMuaSchTol15].

While restrictions of function classes are one possibility to allow to identify the causal structure, other assumptions or scenarios are possible. So far, we have discussed that causal models are expected to generalize under certain distribution shifts since they explicitly model interventions. By the SMS hypothesis, much of the causal structure is assumed to remain invariant. Hence distribution shifts such as observing a system in different “environments / contexts” can significantly help to identify causal structure [Tian2001, PetJanSch17]. These contexts can come from interventions [Schoelkopf2012, peters2016causal, pfister2019learning], non-stationary time series [hyvarinen2017nonlinear, halva2020hidden, pfister2019invariant] or multiple views [gresele2019incomplete, JMLR:v21:19-232]. The contexts can likewise be interpreted as different tasks, which provide a connection to meta-learning [bengio1990learning, finn2017model, schmidhuber1987evolutionary].

The work of bengio2019meta ties the generalization in meta-learning to invariance properties of causal models, using the idea that a causal model should adapt faster to interventions than purely predictive models. This was extended to multiple variables and unknown interventions in [ke2019learning], proposing a framework for causal discovery using neural networks by turning the discrete graph search into a continuous optimization problem. While [bengio2019meta, ke2019learning] focus on learning a causal model using neural networks with an unsupervised loss, the work of dasgupta2019causal explores learning a causal model using a reinforcement learning agent. These approaches have in common that semantically meaningful abstract representations are given and do not need to be learned from high-dimensional and low-level (e.g., pixel) data.

Fig. 2: Illustration of the causal representation learning problem setting. Perceptual data, such as images or other high-dimensional sensor measurements, can be thought of as entangled views of the state of an unknown causal system as described in (10). With the exception of possible task labels, none of the variables describing the causal variables generating the system may be known. The goal of causal representation learning is to learn a representation (partially) exposing this unknown causal structure (e.g., which variables describe the system, and their relations). As full recovery may often be unreasonable, neural networks may map the low-level features to some high-level variables supporting causal statements relevant to a set of downstream tasks of interest. For example, if the task is to detect the manipulable objects in a scene, the representation may separate intrinsic object properties from their pose and appearance to achieve robustness to distribution shifts on the latter variables. Usually, we do not get labels for the high-level variables, but the properties of causal models can serve as useful inductive biases for learning (e.g., the SMS hypothesis).
Fig. 3: Example of the SMS hypothesis where an intervention (which may or may not be intentional/observed) changes the position of one finger (), and as a consequence, the object falls. The change in pixel space is entangled (or distributed), in contrast to the change in the causal model.

Vi Learning Causal Variables

Traditional causal discovery and reasoning assume that the units are random variables connected by a causal graph. However, real-world observations are usually not structured into those units to begin with, e.g., objects in images [LopNisChiSchBot17]. Hence, the emerging field of causal representation learning strives to learn these variables from data, much like machine learning went beyond symbolic AI in not requiring that the symbols that algorithms manipulate be given a priori (cf. geffner). To this end, we could try to connect causal variables to observations


where G is a non-linear function. An example can be seen in Figure 2, where high-dimensional observations are the result of a view on the state of a causal system that is then processed by a neural network to extract high-level variables that are useful on a variety of tasks. Although causal models in economics, medicine, or psychology often use variables that are abstractions of underlying quantities, it is challenging to state general conditions under which coarse-grained variables admit causal models with well-defined interventions [1512.07942, Rubensteinetal17]. Defining objects or variables that can be causally related amounts to coarse-graining of more detailed models of the world, including microscopic structural equation models [Rubensteinetal17], ordinary differential equations [MooijJ2013, RubBonMooSch18], and temporally aggregated time series [Gongetal17]. The task of identifying suitable units that admit causal models is challenging for both human and machine intelligence. Still, it aligns with the general goal of modern machine learning to learn meaningful representations of data, where meaningful can include robust, explainable, or fair [NIPS2017_6995, Kilbertusetal17, ZhaBar18, karimi2020algorithmic, vonkugelgen2020fairness].

To combine structural causal modeling (3) and representation learning, we should strive to embed an SCM into larger machine learning models whose inputs and outputs may be high-dimensional and unstructured, but whose inner workings are at least partly governed by an SCM (that can be parameterized with a neural network). The result may be a modular architecture, where the different modules can be individually fine-tuned and re-purposed for new tasks [ParKilRojSch18, RIMs] and the SMS hypothesis can be used to enforce the appropriate structure. We visualize an example in Figure 3 where changes are sparse for the appropriate causal variables (the position of the finger and the cube changed as a result of moving the finger), but dense in other representations, for example in the pixel space (as finger and cube move, many pixels change their value). At the extreme, all pixels may change as a result of a sparse intervention, for example, if the camera view or the lighting changes.

We now discuss three problems of modern machine learning in the light of causal representation learning.

Problem 1 – Learning Disentangled Representations

We have earlier discussed the ICM Principle implying both the independence of the SCM noise terms in (3) and thus the feasibility of the disentangled representation


as well as the property that the conditionals be independently manipulable and largely invariant across related problems. Suppose we seek to reconstruct such a disentangled representation using independent mechanisms (11) from data, but the causal variables are not provided to us a priori. Rather, we are given (possibly high-dimensional) (below, we think of as an image with pixels ) as in (10), from which we should construct causal variables () as well as mechanisms, cf. (3),


modeling the causal relationships among the . To this end, as a first step, we can use an encoder taking to a latent “bottleneck” representation comprising the unexplained noise variables . The next step is the mapping determined by the structural assignments . Finally, we apply a decoder . For suitable , the system can be trained using reconstruction error to satisfy on the observed images. If the causal graph is known, the topology of a neural network implementing can be fixed accordingly; if not, the neural network decoder learns the composition . In practice, one may not know

, and thus only learn an autoencoder

, where the causal graph effectively becomes an unspecified part of the decoder , possibly aided by a suitable choice of architecture [Leeb-SAE].

Much of the existing work on disentanglement [higgins2016beta, 1811.12359, locatello2020weakly, van2019disentangled, locatello2019fairness, kim2018disentangling, ridgeway2018learning, eastwood2018framework] focuses on independent factors of variation. This can be viewed as the special case where the causal graph is trivial, i.e., in (12). In this case, the factors are functions of the independent exogenous noise variables, and thus independent themselves.777For an example to see why this is often not desirable, note that the presence of fork and knife may be statistically dependent, yet we might want a disentangled representation to represent them as separate entities. However, the ICM Principle is more general and contains statistical independence as a special case.

Note that the problem of object-centric representation learning  [bapst2019structured, burgess2019monet, goyal2020object, greff2019multi, greff2020binding, kosiorek2018sequential, lin2019space, locatello2020object, Julius_ECON, van2018relational] can also be considered a special case of disentangled factorization as discussed here. Objects are constituents of scenes that in principle permit separate interventions. A disentangled representation of a scene containing objects should probably use objects as some of the building blocks of an overall causal factorization888Objects can be represented at different levels of granularity [Rubensteinetal17], i.e. as a single entity or as a composition of other causal variables encoding parts, properties, and other factors of variation. , complemented by mechanisms such as orientation, viewing direction, and lighting.

The problem of recovering the exogenous noise variables is ill-defined in the i.i.d. case as there are infinitely many equivalent solutions yielding the same observational distribution [1811.12359, hyvarinen1999nonlinear, PetJanSch17]. Additional assumptions or biases can help favoring certain solutions over others  [1811.12359, rolinek2019variational]. Leeb-SAE propose a structured decoder that embeds an SCM and automatically learns a hierarchy of disentangled factors.

To make (12) causal, we can use the ICM Principle, i.e., we should make the statistically independent, and we should make the mechanisms independent. This could be done by ensuring that they are invariant across problems, exhibit sparse changes to actions, or that they can be independently intervened upon [1911.10500, 1703.07718, 1812.03253].  locatello2020weakly showed that the sparse mechanism shift hypothesis stated above is theoretically sufficient when given suitable training data. Further, the SMS hypothesis can be used as supervision signal in practice even if  [trauble2020independence]. However, which factors of variation can be disentangled depend on which interventions can be observed [shu2019weakly, locatello2020weakly]. As discussed by SchJanLop16, shu2019weakly, different supervision signals may be used to identify subsets of factors. Similarly, when learning causal variables from data, which variables can be extracted and their granularity depends on which distribution shifts, explicit interventions, and other supervision signals are available.

Problem 2 – Learning Transferable Mechanisms

An artificial or natural agent in a complex world is faced with limited resources. This concerns training data, i.e., we only have limited data for each task/domain, and thus need to find ways of pooling/re-using data, in stark contrast to the current industry practice of large-scale labeling work done by humans. It also concerns computational resources: animals have constraints on the size of their brains, and evolutionary neuroscience knows many examples where brain regions get re-purposed. Similar constraints on size and energy apply as ML methods get embedded in (small) physical devices that may be battery-powered. Future AI models that robustly solve a range of problems in the real world will thus likely need to re-use components, which requires them to be robust across tasks and environments [SchJanLop16]. An elegant way to do this is to employ a modular structure that mirrors a corresponding modularity in the world. In other words, if the world is indeed modular, in the sense that components/mechanisms of the world play roles across a range of environments, tasks, and settings, then it would be prudent for a model to employ corresponding modules [RIMs]

. For instance, if variations of natural lighting (the position of the sun, clouds, etc.) imply that the visual environment can appear in brightness conditions spanning several orders of magnitude, then visual processing algorithms in our nervous system should employ methods that can factor out these variations, rather than building separate sets of face recognizers, say, for every lighting condition. If, for example, our nervous system were to compensate for the lighting changes by a gain control mechanism, then this mechanism in itself need not have anything to do with the physical mechanisms bringing about brightness differences. However, it would play a role in a modular structure that corresponds to the role that the physical mechanisms play in the world’s modular structure. This could produce a bias towards models that exhibit certain forms of structural homomorphism to a world that we cannot directly recognize, which would be rather intriguing, given that ultimately our brains do nothing but turn neuronal signals into other neuronal signals. A sensible inductive bias to learn such models is to look for independent causal mechanisms

[ParRojKilSch17] and competitive training can play a role in this. For pattern recognition tasks, [ParKilRojSch18, RIMs] suggest that learning causal models that contain independent mechanisms may help in transferring modules across substantially different domains.

Problem 3 – Learning Interventional World Models and Reasoning

Deep learning excels at learning representations of data that preserve relevant statistical properties [1206.5538, LeCBenHin15]. However, it does so without taking into account the causal properties of the variables, i.e., it does not care about the interventional properties of the variables it analyzes or reconstructs. Causal representation learning should move beyond the representation of statistical dependence structures towards models that support intervention, planning, and reasoning, realizing Konrad Lorenz’ notion of thinking as acting in an imagined space [Lorenz73]. This ultimately requires the ability to reflect back on one’s actions and envision alternative scenarios, possibly necessitating (the illusion of) free will [Pearl2009forbes]. The biological function of self-consciousness may be related to the need for a variable representing oneself in one’s Lorenzian imagined space, and free will may then be a means to communicate about actions taken by that variable, crucial for social and cultural learning, a topic which has not yet entered the stage of machine learning research although it is at the core of human intelligence [Henrich].

Vii Implications for Machine Learning

All of this discussion calls for a learning paradigm that does not rest on the usual i.i.d. assumption. Instead, we wish to make a weaker assumption: that the data on which the model will be applied comes from a possibly different distribution, but involving (mostly) the same causal mechanisms [PetJanSch17]. This raises serious challenges: (a) in many cases, we need to infer abstract causal variables from the available low-level input features; (b) there is no consensus on which aspects of the data reveal causal relations; (c) the usual experimental protocol of training and test set may not be sufficient for inferring and evaluating causal relations on existing data sets, and we may need to create new benchmarks, for example with access to environment information and interventions; (d) even in the limited cases we understand, we often lack scalable and numerically sound algorithms. Despite these challenges, we argue this endeavor has concrete implications for machine learning and may shed light on desiderata and current practices alike.

Vii-a Semi-Supervised Learning (SSL)

Suppose our underlying causal graph is , and at the same time we are trying to learn a mapping . The causal factorization (4) for this case is


The ICM Principle posits that the modules in a joint distribution’s causal decomposition do not inform or influence each other. This means that in particular, should contain no information about , which implies that SSL should be futile, in as far as it is using additional information about (from unlabelled data) to improve our estimate of .

In the opposite (anticausal) direction (i.e., the direction of prediction is opposite to the causal generative process), however, SSL may be possible. To see this, we refer to Daniusisetal10 who define a measure of dependence between input and conditional .999Other dependence measures have been proposed for high-dimensional linear settings and time series [JanHoySch10, Shajarisales15, BesShaSchJan18, JanSch18b, Janzing_NIPS2019, Janzingetal12]. Assuming that this measure is zero in the causal direction (applying the ICM assumption described in Section IV to the two-variable case), they show that it is strictly positive in the anticausal direction. Applied to SSL in the anticausal direction, this implies that the distribution of the input (now: effect) variable should contain information about the conditional of output (cause) given input, i.e., the quantity that machine learning is usually concerned with.

The study [Schoelkopf2012] empirically corroborated these predictions, thus establishing an intriguing bridge between the structure of learning problems and certain physical properties (cause-effect direction) of real-world data generating processes. It also led to a range of follow-up work [zhang_domain_2013, WeiSchBalGro14, zhang_multi-source_2015, GonZhaLiuTaoSch16, HuaZhaZhaSanGlySch17, Zhangetal17, 1610.03263, 1809.09337, 1903.06256, 1812.04597, 1810.11953, 1807.08479, 1802.03916, Li_2018_ECCV, 1707.06422, RojSchTurPet18, JMLR:v21:19-232], complementing the studies of Bareinboim and Pearl [Bareinboim2014, 1503.01603], and it inspired a thread of work in the statistics community exploiting invariance for causal discovery and other tasks [peters2016causal, pfister2019learning, 1706.08576, 1710.11469, JMLR:v21:19-232].

On the SSL side, subsequent developments include further theoretical analyses [JanSch15, PetJanSch17, Section 5.1.2] and a form of conditional SSL [KugMeyLooSch19]. The view of SSL as exploiting dependencies between a marginal and a non-causal conditional is consistent with the common assumptions employed to justify SSL [ChaSchZie06]. The cluster assumption asserts that the labeling function (which is a property of ) should not change within clusters of . The low-density separation assumption posits that the area where takes the value of should have small ; and the semi-supervised smoothness assumption, applicable also to continuous outputs, states that if two points in a high-density region are close, then so should be the corresponding output values. Note, moreover, that some of the theoretical results in the field use assumptions well-known from causal graphs (even if they do not mention causality): the co-training theorem [BluMit98] makes a statement about learnability from unlabelled data, and relies on an assumption of predictors being conditionally independent given the label, which we would normally expect if the predictors are (only) caused by the label, i.e., an anticausal setting. This is nicely consistent with the above findings.

Vii-B Adversarial Vulnerability

One can hypothesize that the causal direction should also have an influence on whether classifiers are vulnerable to adversarial attacks. These attacks have recently become popular, and consist of minute changes to inputs, invisible to a human observer yet changing a classifier’s output [1312.6199]. This is related to causality in several ways. First, these attacks clearly constitute violations of the i.i.d. assumption that underlies statistical machine learning. If all we want to do is a prediction in an i.i.d. setting, then statistical learning is fine. In the adversarial setting, however, the modified test examples are not drawn from the same distribution as the training examples. The adversarial phenomenon also shows that the kind of robustness current classifiers exhibit is rather different from the one a human exhibits. If we knew both robustness measures, we could try to maximize one while minimizing the other. Current methods can be viewed as crude approximations to this, effectively modeling the human’s robustness as a mathematically simple set, say, an ball of radius : they often try to find examples which lead to maximal changes in the classifier’s output, subject to the constraint that they lie in an ball in the pixel metric. As we think of a classifier as the approximation of a function, the large gradients exploited by these attacks are either a property of this function or a defect of the approximation.

There are different ways of relating this to causal models. As described in [PetJanSch17, Section 1.4], different causal models can generate the same statistical pattern recognition model. In one of those, we might provide a writer with a sequence of class labels , with the instruction to produce a set of corresponding images . Clearly, intervening on will impact , but intervening on will not impact , so this is an anticausal learning problem. In another setting, we might ask the writer to decide for herself which digits to write, and to record the labels alongside the digit (in this case, the classifier would try to predict one effect from another one, a situation which we might call a confounded one). In a last one, we might provide images to a person, and ask the person to generate labels by classifying them.

Let us now assume that we are in the causal setting where the causal generative model factorizes into independent components, one of which is (essentially) the classification function. As discussed in Section III, when specifying a causal model, one needs to determine which interventions are allowed, and a structural assignment will then, by definition, be valid under every possible (allowed) intervention. One may thus expect that if the predictor approximates the causal mechanism that is inherently transferable and robust, adversarial examples should be harder to find [Schoelkopf2017icml, KilParSch19arxiv].101010Adversarial attacks may still exploit the quality of the (parameterized) approximation of a structural equation. Recent work supports this view: it was shown that a possible defense against adversarial attacks is to solve the anticausal classification problem by modeling the causal generative direction, a method which in vision is referred to as analysis by synthesis [schott2018towards]. A related defense method proceeds by reconstructing the input using an autoencoder before feeding it to a classifier [DBR].

Vii-C Robustness and Strong Generalization

We can speculate that structures composed of autonomous modules, such as given by a causal factorization (4), should be relatively robust to swapping out or modifying individual components. Robustness should also play a role when studying strategic behavior, i.e., decisions or actions that take into account the actions of other agents (including AI agents). Consider a system that tries to predict the probability of successfully paying back a credit, based on a set of features. The set could include, for instance, the current debt of a person, as well as their address. To get a higher credit score, people could thus change their current debt (by paying it off), or they could change their address by moving to a more affluent neighborhood. The former probably has a positive causal impact on the probability of paying back; for the latter, this is less likely. Thus, we could build a scoring system that is more robust with respect to such strategic behavior by only using causal features as inputs [1905.09239].

To formalize this general intuition, one can consider a form of out-of-distribution generalization, which can be optimized by minimizing the empirical risk over a class of distributions induced by a causal model of the data [arjovsky2019invariant, RojSchTurPet18, meinshausen2018causality, peters2016causal, Schoelkopf2012]. To describe this notion, we start by recalling the usual empirical risk minimization setup. We have access to data from a distribution and train a predictor in a hypothesis space (e.g., a neural network with a certain architecture predicting from ) to minimize the empirical risk




Here, we denote by the empirical mean computed from a sample drawn from . When we refer to “out-of-distribution generalization” we mean having a small expected risk for a different distribution :


Clearly, the gap between and will depend on how different the test distribution is from the training distribution . To quantify this difference, we call environments the collection of different circumstances that give rise to the distribution shifts such as locations, times, experimental conditions, etc. Environments can be modeled in a causal factorization (4) as they can be seen as interventions on one or several causal variables or mechanisms. As a motivating example, one environment may correspond to where a measurement is taken (for example a certain room), and from each environment, we obtain a collection of measurements (images of objects in the same room). It is nontrivial (and in some cases provably hard [ben2010impossibility]) to learn statistical models that are stable across training environments and generalize to novel testing environments [peters2016causal, RojSchTurPet18, 1707.06422, arjovsky2019invariant, akkaya2019solving] drawn from the same environment distribution.

Using causal language, one could restrict to be the result of a certain set of interventions, i.e., where is a set of interventional distributions over a causal graph . The worst case out-of-distribution risk then becomes


To learn a robust predictor, we should have available a subset of environment distributions and solve


In practice, solving (18) requires specifying a causal model with an associated set of interventions. If the set of observed environments does not coincide with the set of possible environments , we have an additional estimation error that may be arbitrarily large in the worst case [arjovsky2019invariant, ben2010impossibility].

Vii-D Pre-training, Data Augmentation, and Self-Supervision

Learning predictive models solving the min-max optimization problem of (18) is challenging. We now interpret several common techniques in Machine Learning as means of approximating (18).

The first approach is enriching the distribution of the training set. This does not mean obtaining more examples from , but training on a richer dataset [sun2017revisiting, deng2009imagenet], for example, through pre-training on a huge and diverse corpus [radford2018improving, devlin2018bert, howard2018universal, kolesnikov2019big, djolonga2020robustness, brown2020language, chen2020generative, tschannen2020self]. Since this strategy is based on standard empirical risk minimization, it can achieve stronger generalization in practice only if the new training distribution is sufficiently diverse to contain information about other distributions in .

The second approach, often coupled with the previous one, is to rely on data augmentation to increase the diversity of the data by “augmenting” it through a certain type of artificially generated interventions [Baird90, simard2003best, krizhevsky2012imagenet]. For the visual domain, common augmentations include performing transformations such as rotating the image, translating the image by a few pixels, or flipping the image horizontally, etc. The high-level idea behind data augmentation is to encourage a system to learn underlying invariances or symmetries present in the augmented data distribution. For example, in a classification task, translating the image by a few pixels does not change the class label. One may view it as specifying a set of interventions

the model should be robust to (e.g., random crops/interpolations/translation/rotations, etc). Instead of computing the maximum over all distributions in

, one can relax the problem by sampling from the interventional distributions and optimize an expectation over the different augmented images on a suitably chosen subset [BurSch97], using a search algorithm like reinforcement learning [cubuk2019autoaugment] or an algorithm based on density matching [lim2019fast].

The third approach is to rely on self-supervision to learn about . Certain pre-training methods  [radford2018improving, devlin2018bert, howard2018universal, brown2020language, chen2020generative, tschannen2020self] have shown that it is possible to achieve good results using only very few class labels by first pre-training on a large unlabeled dataset and then fine-tuning on few labeled examples. Similarly, pre-training on large unlabeled image datasets can improve performance by learning representations that can efficiently transfer to a downstream task, as demonstrated by [oord2018representation, bachman2019learning, he2020momentum, chen2020simple, grill2020bootstrap]

. These methods fall under the umbrella of self-supervised learning, a family of techniques for converting an unsupervised learning problem into a supervised one by using so-called pretext tasks with artificially generated labels without human annotations. The basic idea behind using pretext tasks is to force the learner to learn representations that contain information about

that may be useful for (an unknown) downstream task. Much of the work on methods that use self-supervision relies on carefully constructing pretext tasks. A central challenge here is to extract features that are indeed informative about the data generating distribution. Ideas from the ICM Principle could help develop methods that can automate the process of constructing pretext tasks. Finally, one can explicitly optimize (18), for example, through adversarial training [goodfellow2014explaining]. In that case, would contain a set of attacks an adversary might perform, while presently, we consider a set of natural interventions.

An interesting research direction is the combination of all these techniques, large scale training, data augmentation, self-supervision, and robust fine-tuning on the available data from multiple, potentially simulated environments.

Vii-E Reinforcement Learning

Reinforcement Learning (RL) is closer to causality research than the machine learning mainstream in that it sometimes effectively directly estimates do-probabilities. E.g., on-policy learning estimates do-probabilities for the interventions specified by the policy (note that these may not be hard interventions if the policy depends on other variables). However, as soon as off-policy learning is considered, in particular in the batch (or observational) setting [Lange2012], issues of causality become subtle [1812.10576, 1805.12298]. An emerging line of work devoted to the intersection of RL and causality includes [Bareinboim2015, 1703.07718, 1812.10576, 1811.06272, dasgupta2019causal, Bareinboim_NIPS2019, ahmed2021causalworld]. Causal learning applied to reinforcement learning can be divided into two aspects, causal induction and causal inference. Causal induction (discovery) involves learning causal relations from data, for example, an RL agent learning a causal model of the environment. Causal inference learns to plan and act based on a causal model. Causal induction in an RL setting poses different challenges than the classic causal learning settings where the causal variables are often given. However, there is accumulating evidence supporting the usefulness of an appropriate structured representation of the environment [akkaya2019solving, berner2019dota, vinyals2019grandmaster].

World Models

Model-based RL [sutton1998introduction, finn2017model] is related to causality as it aims at modeling the effect of actions (interventions) on the current state of the world. Particularly relevant for causal leaning are generative world models that capture some of the causal relations underlying the environment and serve as Lorenzian imagined spaces (see Introduction above) to train RL agents [kaelbling1996reinforcement, sutton1998introduction, ha2018world, chiappa2017recurrent, xie2016model, oh2015action, silver2017predictron, schmidhuber1991curious, wiering2012reinforcement]. Structured generative approaches further aim at decomposing an environment into multiple entities with causally correct relations among them, modulo the completeness of the variables, and confounding [diuk2008object, watters2019cobra, chang2016compositional, watters2017visual, battaglia2016interaction, kipf2018neural]. However, many of the current approaches (regardless of structure), only build partial models of the environment [gregor2019shaping]. Since they do not observe the environment at every time step, the environment may become an unobserved confounder affecting both the agent’s actions and the reward. To address this issue, a model can use the backdoor criterion conditioning on its policy [rezende2020causally].

Generalization, Robustness, and Fast Transfer

While RL has already achieved impressive results, the sample complexity required to achieve consistently good performance is often prohibitively high. Further, RL agents are often brittle (if data is limited) in the face of even tiny changes to the environment (either visual or mechanistic changes) unseen in the training phase. The question of generalization in RL is essential to the field’s future both in theory and practice. One proposed solution towards the goal of designing machines that can extrapolate experience across environments and tasks is to learn invariances in a causal graph structure. A key requirement to learn invariances from data may be the possibility to perform and learn from interventions. Work in developmental psychology argues that there is a need to experiment in order to discover causal relationships [gopnik2004theory]. This can be modelled as an RL environment, where the agent can discover causal factors through interventions and observing their effects. Further, causal models may allow to model the environment as a set of underlying independent causal mechanisms such that, if there is a change in distribution, not all the mechanisms need to be re-learned. However, there are still open questions about the right way to think about generalization in RL, the right way to formalize the problem, and the most relevant tasks.


Counterfactual reasoning has been found to improve the data efficiency of RL algorithms [1811.06272, 2012.09092], improve performance [dasgupta2019causal], and it has been applied to communicate about past experiences in the multi-agent setting [foerster2018counterfactual, su2020counterfactual]. These findings are consistent with work in cognitive psychology [epstude2008functional], arguing that counterfactuals allow to reason about the usefulness of past actions and transfer these insights to corresponding behavioral intentions in future scenarios [roese1994functional, reichert1999reflective, landman1995missed].

We argue that future work in RL should consider counterfactual reasoning as a critical component to enable acting in imagined spaces and formulating hypotheses that can be subsequently tested with suitably chosen interventions.

Offline RL

The success of deep learning methods in the case of supervised learning can be largely attributed to the availability of large datasets and methods that can scale to large amounts of data. In the case of reinforcement learning, collecting large amounts of high-fidelity diverse data from scratch can be expensive and hence becomes a bottleneck. Offline RL [fujimoto2019off, levine2020offline] tries to address this concern by learning a policy from a fixed dataset of trajectories, without requiring any experimental or interventional data (i.e., without any interaction with the environment). The effective use of observational data (or logged data) may make real-world RL more practical by incorporating diverse prior experiences. To succeed at it, an agent should be able to infer the consequence of different sets of actions compared to those seen during training (i.e., the actions in the logged data), which essentially makes it a counterfactual inference problem. The distribution mismatch between the current policy and the policy that was used to collect offline data makes offline RL challenging as this requires us to move well beyond the assumption of independently and identically distributed data. Incorporating invariances, by factorizing knowledge in terms of independent causal mechanisms can help make progress towards the offline RL setting.

Vii-F Scientific Applications

A fundamental question in the application of machine learning in natural sciences is to which extent we can complement our understanding of a physical system with machine learning. One interesting aspect is physics simulation with neural networks [grzeszczuk1998neuroanimator], which can substantially increase the efficiency of hand-engineered simulators [he2019learning, ladicky2015data, wiewel2019latent, sanchez2020learning, watters2017visual]. Significant out-of-distribution generalization of learned physical simulators may not be necessary if experimental conditions are carefully controlled, although the simulator has to be completely re-trained if the conditions change.

On the other hand, the lack of systematic experimental conditions may become problematic in other applications such as healthcare. One example is personalized medicine, where we may wish to build a model of a patient health state through a multitude of data sources, like electronic health records and genetic information [esteva2019guide, henry2015targeted]. However, if we train a clinical system on doctors’ actions in controlled settings, the system will likely provide little additional insight compared to the doctors’ knowledge and may fail in surprising ways when deployed [beede2020human]. While it may be useful to automate certain decisions, an understanding of causality may be necessary to recommend treatment options that are personalized and reliable [Richens2020vs, subbaswamy2018counterfactual, schulam2017reliable, yoon2018ganite, atan2018deep, alaa2018limits, bica2019time, 2012.09092].

Causality also has significant potential in helping understand medical phenomena, e.g., in the current Covid-19 pandemic, where causal mediation analysis helps disentangle different effects contributing towards case fatality rates when a textbook example of Simpson’s paradox was observed [vonkugelgen2020simpsons].

Another example of a scientific application is in astronomy, where causal models were used to identify exoplanets under the confounding of the instrument. Exoplanets are often detected as they partially occlude their host star when they transit in front of it, causing a slight decrease in brightness. Shared patterns in measurement noise across stars light-years apart can be removed in order to reduce the instrument’s influence on the measurement [Scholkopfetal16], which is critical especially in the context of partial technical failures as experienced in the Kepler exoplanet search mission. The application of [Scholkopfetal16] lead to the discovery of 36 planet candidates [Foreman-Mackeyetal15], of which 21 were subsequently validated as bona fide exoplanets [Montet_2015]. Four years later, astronomers found traces of water in the atmosphere of the exoplanet K2-18b — the first such discovery for an exoplanet in the habitable zone, i.e., allowing for liquid water [1909.04642, Tsiaras]. This planet turned out to be one that had first been detected in [Foreman-Mackeyetal15, exoplanet candidate EPIC 201912552].

Vii-G Multi-Task Learning and Continual Learning

State-of-the-art AI is relatively narrow, i.e., trained to perform specific tasks, as opposed to the broad, versatile intelligence allowing humans to adapt to a wide range of environments and develop a rich set of skills. The human ability to discover robust, invariant high-level concepts and abstractions, and to identify causal relationships from observations appears to be one of the key factors allowing for a successful generalization from prior experiences to new, often quite different, “out-of-distribution” settings.

Multi-task learning refers to building a system that can solve multiple tasks across different environments [caruana1997multitask, ruder2017overview]. These tasks usually share some common traits. By learning similarities across tasks, a system could utilize knowledge acquired from previous tasks more efficiently when encountering a new task. One possibility of learning such similarities across tasks is to learn a shared underlying data-generating process as a causal generative model whose components satisfy the SMS hypothesis [SchJanLop16]. In certain cases, causal models adapt faster to sparse interventions in distribution [ke2019learning, priol2020analysis].

At the same time, we have clearly come a long way already without explicitly treating the multi-task problem as a causal one. Fuelled by abundant data and compute, AI has made remarkable advances in a wide range of applications, from image processing and natural language processing [brown2020language] to beating human world champions in games such as chess, poker and Go [schrittwieser2019mastering], improving medical diagnoses [lundervold2019overview], and generating music [dhariwal2020jukebox]. A critical question thus arises:

“Why can’t we just train a huge model that learns environments’ dynamics (e.g. in a RL setting) including all possible interventions? After all, distributed representations can generalize to unseen examples and if we train over a large number of interventions we may expect that a big neural network will generalize across them”

. To address this, we make several points. To begin with, if data was not sufficiently diverse (which is an untestable assumption a priori), the worst-case error to unseen shifts may still be arbitrarily high (see Section VII-C). While in the short term, we can often beat “out-of-distribution” benchmarks by training bigger models on bigger datasets, causality offers an important complement. The generalization capabilities of a model are tied to its assumptions (e.g., how the model is structured and how it was trained). The causal approach makes these assumptions more explicit and aligned with our understanding of physics and human cognition, for instance by relying on the Independent Causal Mechanisms principle. When these assumptions are valid, a learner that does not use them should fare worse than one that does. Further, if we had a model that was successful in all interventions over a certain environment, we may want to use it in different environments that share similar albeit not necessarily identical dynamics. The causal approach, and in particular the ICM principle, point to the need to decompose knowledge about the world into independent and recomposable pieces (recomposable depending on the interventions or changes in environment), which suggests more work on modular ML architectures and other ways to enforce the ICM principle in future ML approaches.

At its core, i.i.d. pattern recognition is but a mathematical abstraction, and causality may be essential to most forms of animate learning. Until now, machine learning has neglected a full integration of causality, and this paper argues that it would indeed benefit from integrating causal concepts. We argue that combining the strengths of both fields, i.e., current deep learning methods as well as tools and ideas from causality, may be a necessary step on the path towards versatile AI systems.

Viii Conclusion

In this work, we discussed different levels of models, including causal and statistical ones. We argued that this spectrum builds upon a range of assumptions both in terms of modeling and data collection. In an effort to bring together causality and machine learning research programs, we first presented a discussion on the fundamentals of causal inference. Second, we discussed how the independent mechanism assumptions and related notions such as invariance offer a powerful bias for causal learning. Third, we discussed how causal relations might be learned from observational and interventional data when causal variables are observed. Fourth, we discussed the open problem of causal representation learning, including its relation to recent interest in the concept of disentangled representations in deep learning. Finally, we discussed how some open research questions in the machine learning community may be better understood and tackled within the causal framework, including semi-supervised learning, domain generalization, and adversarial robustness.

Based on this discussion, we list some critical areas for future research:

Learning Non-Linear Causal Relations at Scale

Not all real-world data is unstructured and the effect of interventions can often be observed, for example, by stratifying the data collection across multiple environments. The approximation abilities of modern machine learning methods may prove useful to model non-linear causal relations among large numbers of variables. For practical applications, classical tools are not only limited in the linearity assumptions often made but also in their scalability. The paradigms of meta- and multi-task learning are close to the assumptions and desiderata of causal modeling, and future work should consider (1) understanding under which conditions non-linear causal relations can be learned, (2) which training frameworks allow to best exploit the scalability of machine learning approaches, and (3) providing compelling evidence on the advantages over (non-causal) statistical representations in terms of generalization, re-purposing, and transfer of causal modules on real-world tasks.

Learning Causal Variables

Disentangled” representations learned by state-of-the-art neural network methods are still distributed in the sense that they are represented in a vector format with an arbitrary ordering in the dimensions. This fixed-format implies that the representation size cannot be dynamically changed; for example, we cannot change the number of objects in a scene. Further, structured and modular representation should also arise when a network is trained for (sets of) specific tasks, not only auteoncoding. Different high-level variables may be extracted depending on the task and affordances at hand. Understanding under which conditions causal variables can be recovered could provide insights into which interventions we are robust to in predictive tasks.

Understanding the Biases of Existing Deep Learning Approaches

Scaling to massive data sets, relying on data augmentation and self-supervision have all been successfully explored to improve the robustness of the predictions of deep learning models. It is nontrivial to disentangle the benefits of the individual components and it is often unclear which “trick” should be used when dealing with a new task, even if we have an intuition about useful invariances. The notion of strong generalization over a specific set of interventions may be used to probe existing methods, training schemes, and datasets in order to build a taxonomy of inductive biases. In particular, it is desirable to understand how design choices in pre-training (e.g., which datasets/tasks) positively impact both transfer and robustness downstream in a causal sense.

Learning Causally Correct Models of the World and the Agent

In many real-world reinforcement learning (RL) settings, abstract state representations are not available. Hence, the ability to derive abstract causal variables from high-dimensional, low-level pixel representations and then recover causal graphs is important for causal induction in real-world reinforcement learning settings. Moreover, building a causal description for both a model of the agent and the environment (world models) should be essential for robust and versatile model-based reinforcement learning.

Ix Acknowledgments

Many thanks to the past and present members of the Tübingen causality team, without whose work and insights this article would not exist, in particular to Dominik Janzing, Chaochao Lu and Julius von Kügelgen who gave helpful comments on [1911.10500]. The text has also benefitted from discussions with Elias Bareinboim, Christoph Bohle, Leon Bottou, Isabelle Guyon, Judea Pearl, and Vladimir Vapnik. Thanks to Wouter van Amsterdam for pointing out typos in the first version. We also thank Thomas Kipf, Klaus Greff, and Alexander d’Amour for the useful discussions. Finally, we thank the thorough anonymous reviewers for highly valuable feedback and suggestions.