# Robust Learning via Cause-Effect Models

We consider the problem of function estimation in the case where the data distribution may shift between training and test time, and additional information about it may be available at test time. This relates to popular scenarios such as covariate shift, concept drift, transfer learning and semi-supervised learning. This working paper discusses how these tasks could be tackled depending on the kind of changes of the distributions. It argues that knowledge of an underlying causal direction can facilitate several of these tasks.

## Authors

• 143 publications
• 37 publications
• 32 publications
• 61 publications
• ### On Causal and Anticausal Learning

We consider the problem of function estimation in the case where an unde...
06/27/2012 ∙ by Bernhard Schoelkopf, et al. ∙ 0

• ### Self-Supervised Dynamic Networks for Covariate Shift Robustness

As supervised learning still dominates most AI applications, test-time p...
06/06/2020 ∙ by Tomer Cohen, et al. ∙ 42

• ### Adaptive Risk Minimization: A Meta-Learning Approach for Tackling Group Shift

A fundamental assumption of most machine learning algorithms is that the...
07/06/2020 ∙ by Marvin Zhang, et al. ∙ 5

• ### Robustness to Spurious Correlations via Human Annotations

The reliability of machine learning systems critically assumes that the ...
07/13/2020 ∙ by Megha Srivastava, et al. ∙ 14

• ### Test-Time Training for Out-of-Distribution Generalization

We introduce a general approach, called test-time training, for improvin...
09/29/2019 ∙ by Yu Sun, et al. ∙ 8

• ### Embedding Propagation: Smoother Manifold for Few-Shot Classification

Few-shot classification is challenging because the data distribution of ...
03/09/2020 ∙ by Pau Rodríguez, et al. ∙ 4

• ### On Measuring and Quantifying Performance: Error Rates, Surrogate Loss, and an Example in SSL

In various approaches to learning, notably in domain adaptation, active ...
07/13/2017 ∙ by Marco Loog, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

By and large, statistical machine learning exploits statistical associations or dependences between variables to make predictions about certain variables. This is a very powerful concept especially in situations where we have sizable training sets, but no detailed model of the underlying data generating process. This process is usually modelled as an unknown probability distribution, and machine learning excels whenever this distribution does not change. Most of the theoretical analysis assumes that the data are i.i.d. (independent and identically distributed) or at least exchangeable.

On the other hand, practical problems often do not have these favorable properties, forcing us to leave the comfort zone of i.i.d. data. Sometimes distributions shift over time, sometimes we might want to combine data recorded under different conditions or from different but related regularities. Researchers have developed a number of modifications of statistical learning methods to handle various scenarios of changing distributions, for an overview, see [1].

The present paper attempts to study these problems from the point of view of causal learning. As some other recent work in the field [2, 3], it will build on the assumption that in causal structures, the distribution of the cause and the mechanism relating cause and effect tend to be independent111In the mentioned references, independence is meant in the sense of algorithmic independence, but other notions of independence can also make sense.. For instance, in the problem of predicting splicing patterns from genomic sequences, the basic splicing mechanism (driven by the ribosome) may be assumed stable betwen different species [4], even though the genomic sequences and their statistical properties might differ in several respects. This is important information constraining causal models, and it can also be useful for robust predictive models as we try to show in the present paper. Intuitively, if we learn a causal model of splicing, we could hope to be more robust with respect to changes of the input statistics, and we may be able to combine data collected from different species to get a more accurate statistical model of the splicing mechanism.

Causal graphical models as pioneered by [5, 6] are usually thought of as joint probability distributions over a set of variables , along with directed graphs (for simplicity, we assume acyclicity) with vertices and arrows indicating direct causal influences. The causal Markov assumption [5] states that each vertex is independent of its non-descendants in the graph, given its parents. Here, independence is usually meant in a statistical sense, although alternative views have been developed, e.g., using algorithmic independence [3].

Crucially, the causal Markov assumption links the semantics of causality to something that has empirically measurable consequences (e.g., conditional statistical independence). Given a sufficient set of observations from a joint distribution, it allows us to test conditional independence statements and thus infer (subject to a genericity assumption referred to as “faithfulness”) which causal models are consistent with an observed distribution. However, this will typically not lead us to a unique causal model, and in the case of graphs with only two variables, there are no conditional independence statements to test and we cannot do anything.

There is an alternative view of causal models, which does not start from a joint distribution. Instead, it assumes a set of jointly independent noise variables, one at each vertex, and each vertex computes a deterministic function of its noise variables and its parents. This view, referred to as a functional causal model (or nonlinear structural equation model), entails a joint distribution which along with the graph satisfies the causal Markov assumption [5]. Vice versa, each causal graphical model can be expressed as a functional causal model [3, e.g.]. 222As an aside, note that the functional point of view is more specific than that graphical model view [5]. To see this consider and the following two functional models that lead to the same joint distribution: (1) with and (2) with and . Suppose one observes the sample . (1) and (2) give different answers to the counterfactual question “What would have happened if had been one?”. The causal graph and the joint distribution does not provide sufficient information to give any answer.

The functional point of view is rather useful in that it allows us to come up with assumptions on causal models that would be harder to conceive in a pure probabilistic view. It has recently been shown [7] that an assumption of nonlinear functions with additive noise renders the two variable case (and thus the multivariate case [8]) identifiable, i.e., we can distinguish between the causal structures and , given that one and only one of these two alternatives is true (which implicitly excludes a common cause of and ). Hence, we can tackle the case where conditional independence tests do not provide any information. This opens up the possibility to identify the causal direction for input-output learning problems. The present paper assays whether this can he helpful for machine learning, and it argues that in many situations, a causal model can be more robust under distribution shifts than a pure statistical model. Perhaps somewhat suprisingly, learning problems need not always predict effect from cause, and the direction of the prediction has consequences for which tasks are easy and which tasks are hard. In the remainder of the paper, we restrict ourselves to the simplest possible case, where we have two variables only and there are no unobserved confounders.

##### Notation.

We consider the causal structure shown in Fig. 1

, with two observables, modeled by random variables. When using the notation

and , the variable stands for the cause and for the effect. We denote their domains and and their distributions by and (overloading the notation ). When using the notation and , the variable will always be the input and the output, from a machine learning point of view (but input and output can either be cause of effect — more below).

For simplicity, we assume that their distributions have a joint density with respect to some product measure. We write the values of this density as and the values of the marginal densities as and , again keeping in mind that these three are different functions — we can always tell from the argument which function is meant.

We identify a training set of size with a uniform mixture of Dirac measures, denoted as and use an analogous notation for an additional data set of size (e.g., a set of test inputs). E.g., could be a set of test inputs sampled from a distribution that need not be identical with . The following assumptions are used throughout the paper. The subsections below only mention additional assumptions that are task specific.

##### Causal sufficiency.

We further assume that there are two independent noise variables and , modeled as random variables with domains and and distributions and . In some places, we will use conditional densities, always implicitly assuming that they exist.

The function and the noise term jointly determine the conditional via

 E=φ(C,NE).

We think of as the mechanism transforming cause into effect .

##### Indepence of mechanism and input.

We finally assume that the mechanism is “independent” of the distribution of the cause (i.e., independent of in Fig. 1), in the sense that contains no information about and vice versa; in particular, if changes at some point in time, there is no reason to believe that changes at the same time.333A stronger condition, which we do not need in the present context, would be to require that , and be jointly “independent.” This assumption has been used by [2, 3]. It encapsulates our belief that is a mechanism of nature that does not care what we feed into it. The assumption introduces an important asymmetry between cause and effect, since it will usually be violated in the backward direction, i.e., the distribution of the effect will inherit properties from [3, 9].

##### Richness of functional causal models

It turns out that the two-variable functional causal model is so rich that it cannot be identified. The causal Markov condition is trivially satisfied both by the forward model and the backward model, and thus both graphs allow a functional model.

To understand the richness of the class intuitively, consider the simple case where the noise can take only a finite number of values, say . This noise could affect for instance as follows: there is a set of functions , and the noise randomly switches one of them on at any point, i.e.,

 φ(c,n)=φn(c).

The functions could implement arbitrarily different mechanisms, and it would thus be very hard to identify from empirical data sampled from such a complex model.444A similar construction, with the range of the noise having the cardinality of the function class, can be used [3] to argue that every causal graphical model can be expressed as a functional causal model.

As an aside, recall that for acyclic causal graphs with more than two variables, the graph structure will typically imply conditional independence properties via the causal Markov condition. However, the above construction with noises randomly switching between mechanisms is still valid, and it is thus surprising that conditional independence alone does allow us to do some causal inference of practical significance, as implemented by the well known PC and FCI algorithms [6, 5]. It should be clear that additional assumptions that prevent the noise switching construction should significantly facilitate the task of identifying causal graphs from data. Intuitively, such assumptions need to control the complexity with which the noise given a training set plus two unpaired sets from the two original marginals.

One such assumption is referred to as ANM, standing for nonlinear non-Gaussian acyclic model [7]. This model assumes for some function :

 E=ϕ(C)+NE, (1)

and it has been shown that and can be identified in the generic case, provided that is assumed to have zero mean. This means that apart from some exceptions, such as the case where is linear and is Gaussian, a given joint distribution of two real-valued random variables and can be fit by an ANM model in at most one direction (which we then consider the causal one).

A similar statement has been shown for discrete data [10] and for the postnonlinear ANM model [11]

 E=ψ(ϕ(C)+NE),

where is an invertible function.

In practice, an ANM model can be fit by regressing the effect on the cause while enforcing that the residual noise variable is independent of the cause [12]. If this is impossible, the model is incorrect (e.g., cause and effect are interchanged, the noise is not additive, or there are confounders).

ANM plays an important role in this paper; first, because all the methods below will presuppose that we know what is cause and what is effect, and second, because we will generalize ANM to handle the case where we have several models of the form (1) that share the same .

## 2 Predicting Effect from Cause

Let us consider the case where we are trying to estimate a function or a conditional distribution in the causal direction, i.e., that is the cause and the effect. Intuitively, this situation of causal prediction should be the ’easy’ case since there exists a functional mechanism which should try to mimic. We are interested in the question how robust (or invariant) the estimation is with respect to changes in the noise variables of the underlying functional causal model.

#### 2.1.1 Robustness w.r.t. input changes (distribution shift)

##### Given:

training points sampled from and an additional set of inputs sampled from , with .

estimate .

none.

##### Solution:

by independence of mechanism and input, there is no reason to assume that the observed change in (i.e., in ) entails a change in , and we thus conclude . This scenario is referred to as covariate shift [1].

#### 2.1.2 Semi-supervised learning

##### Given:

training points sampled from and an additional set of inputs sampled from .

estimate .

##### Note:

by independence of the mechanism, contains no information about . A more accurate estimate of , as may be possible by the addition of the test inputs , does thus not influence an estimate of , and semi-supervised learning (SSL) is pointless for the scenario in Figure 2.

#### 2.2.1 Robustness w.r.t. output changes

##### Given:

training points sampled from and an additional set of outputs sampled from , with .

estimate .

##### Assumption:

various options, e.g., an additive Gaussian noise model where is indecomposable and is also indecomposable, if it is different from .

##### Solution:

first we need to decide whether or has changed. This can be done using the method Localizing Distribution Change (Subsection 4.2) under appropriate assumptions (see above). If has changed, proceed as in Subsubsection 2.1.1. If has changed, we can, estimate via Estimating Causal Conditional (Subsection 4.3). Here, additive noise is a sufficient assumption.

#### 2.2.2 Semi-supervised learning

##### Given:

training points sampled from and an additional set of outputs sampled from .

estimate .

##### Assumption:

has an additive noise model from to and has a unique decomposition as convolution of two distributions, say . This is, for instance satisfied if the noise is Gaussian and is indecomposable.

##### Solution:

The additional outputs help because the decomposition tells us that either or . The additive noise model learned from the -pairs will probably tell us which of the alternatives is true. Knowing , the conditional reduces to learning from the -pairs, which is certainly a weaker problem than learning would be in general.

#### 2.3.1 Transfer learning (only nosie changes)

##### Given:

training points sampled from and an additional set of points sampled from , with .

estimate .

##### Assumption:

Additive noise where is invariant, but the noises can change.

##### Solution:

run conditional ANM to output a single function, only enforcing independence of residuals separately for the two data sets (Section 4.4).

There is also a semi-supervised learning variant of this scenario: Given given a training set plus two unpaired sets from the two original marginals, then the extra sets help to better estimate because we have argued in Subsubsection 2.2.2 that additional -values sampled from already help.

#### 2.3.2 Concept drift (only meachnism changes)

##### Given:

training points sampled from and an additional set of points sampled from , with .

estimate .

##### Assumption:

invariant, but has changed.

##### Solution:

Apply ANM to points sampled from to obtain . Then is given by

 P′(Y|X)=PNY(Y−ϕ(X)).

## 3 Predicting Cause from Effect

We now turn to the opposite direction, where we consider the effect as observed and we try to predict the value of the cause variable that led to it. This situation of anticausal prediction may seem unnatural, but it is actually ubiquitous in machine learning. Consider, for instance, the task of predicting the class label of a handwritten digit from its image. The underlying causal structure is as follows: a person intends to write the digit 7, say, and this intention causes a motor pattern producing an image of the digit 7 — in that sense, it is justified to consider the class label the cause of the image .

#### 3.1.1 Robustness w.r.t. input changes (distribution shift)

##### Given:

training points sampled from and an additional set of inputs sampled from , with .555A related scenario is that we do not have additional data from , but we want to still use our knowledge of the causal direction to learn a model that is somewhat robust w.r.t. changes of due to changes in either or .

estimate .

##### Assumption:

additive Gaussian noise with invertible function and indecomposable is sufficient. Other assumptions are also possible, but invertibility of the causal conditional is necessary in any case.

##### Solution:

We apply Localizing Distribution Change (Subsection 4.2) to decide if or has changed. In the first case, we can estimate via Inverting Conditionals (Subsection 4.1) if we assume that is an injective conditional.666This term will be introduced in Subsection 4.1. Injectivity means that the input distribution can uniquely be computed from the output distribution. We will give examples of injective conditionals later. From this we get , and then

 P′(Y|X)=P′(X,Y)∫P′(X,Y)dY.

If, of the other hand, has changed, we can estimate via Estimating Causal Conditionals (Subsection 4.3).

#### 3.1.2 Semi-supervised learning

##### Given:

training points sampled from and an additional set of inputs sampled from .

estimate .

unclear.

##### Note:

by dependence of the mechanism, contains information about . The additional inputs thus may allow a more accurate estimate of .777Note that a weak form of SSL could roughly work as followst: after learning a generative model for from the first part of the sample, we can use the additional samples from to double check whether our model generates the right distribution for .

Known methods for semi-supervised learning can indeed be viewed in this way. For instance, the cluster assumption says that points that lie in the same cluster of should have the same

; and the low density separation assumption says that the decision boundary of a classifier (i.e., the point where

crosses ) should lie in a region where is small. The semi-supervised smoothness assumption says that the estimated function (which we may think of as the expectation of should be smooth in regions where is large (for an overview of the common assumptions, see [13]). Some algorithms assume a model for the causal mechanism,

, which is usually a Gaussian distribution or mixture of Gaussians, and learn it on both labeled and unlabeled data

[14]. Note that all these assumptions translate properties of into properties of .

Using a more accurate estimate of , we could also try to proceed as in Subsubsection 3.1.1.888However, in this case we do not have the two alternatives of whether or has changed. The question now should be: given a better estimate of , does that change our estimate of , or of ?

#### 3.2.1 Robustness w.r.t. output changes

##### Given:

training points sampled from and an additional set of outputs sampled from , with .

estimate .

none.

##### Solution:

independence of mechanism implies , hence . From this, we compute

 P′(Y|X)=P′(X|Y)P′(Y)∫P′(X,Y)dY.

There may also be room for a semi-supervised learning variant: suppose we have additional output observations rather than additional inputs as in standard SSL — in which situations does this help?

#### 3.3.1 Robustness w.r.t. changes of input and output noise (transfer learning)

##### Given:

training points sampled from and an additional set of points sampled from , with .

estimate .

##### Assumption:

additive noise where is invariant, but the noises can change.

##### Solution:

analogous to Subsection 2.3.1, but use the model backwards in the end.

#### 3.3.2 Concept drift (changes of the mechanism)

##### Given:

training points sampled from and an additional set of points sampled from , with .

estimate .

##### Assumption:

invariant, but has changed to .

##### Solution:

We can learn from and then estimate the entire distribution using the estimations of our distributions and obtained from observing those pairs that were taken from .

## 4 Modules

### 4.1 Inverting Conditionals

We can think of a conditional as a mechanism that transforms into . In some cases, we do not loose any information by this mechanism:

###### Definition 1 (injective conditionals)

a conditional distribution is called injective if there are no two distributions such that

 ∫P(y|x)P(x)dx=∫P(y|x)P′(x)dx.
###### Example 1 (full rank stochastic matrix)

Let have finite range. Then

is given by a stochastic matrix

and is injective if and only if has full rank. Note that this is only possible if .

###### Example 2 (Post-nonlinear model)

Let be real-valued and

 Y=ψ(ϕ(X)+NY) with NY⊥⊥X,

be a post-nonlinear model where and are injective. Then the distribution of uniquely determines the distribution of because is invertible. This in turn, uniquely determines the distribution of provided that the convolution with is invertible. Since is invertible, this determines the distribution of uniquely.

Note that additive noise models with injective are a special case of a post-non-linear model by setting .

### 4.2 Localizing distribution change

Given data points sampled from and additional points from , we wish to decide whether or has changed. Assume

 E=ϕ(C)+NE,

with the same for both distributions and , but the distribution of the noise or the distribution of changes. Let denote the distribution of .999Explicitly, it is derived from the distribution of by .

Then the distributions of the effect are given by

 P(E) = P(ϕ(C))∗P(NE) P′(E) = P′(ϕ(C))∗P′(NE),

where either or . To decide which of these cases is true, we first estimate from the first data set, and then apply a deconvolution with (denoted with ) or to and check whether (1) or (2) is a probability distribution. Below we will dicuss one possible set of assumptions that ensure that exactly one of the alternatives should is true. In case (1), has changed. In case (2), has changed.

To show that there are (not too artificial) asumptions that render the problem solvable, assume that and are indecomposable and and are Gaussian with zero mean. Then the distribution uniquely determines by deconvolving with the Gaussian of maximal possible width that still yields a probability distribution.

We are aware that there exist situations where both cases are possible. For instance, consider the example in which

follows a uniform distribution,

, while when generating , and . That is, when generating the new data, only was changed. However, applying the deconvolution with to results in , which still corresponds a valid distribution. Consequently, we have to conclude that both cases are possible.

Despite the examples where the proposed method fails, the proposed method also works in – hopefully – many situations. For instance, now let us switch the roles of and in the example above, or in other words, suppose and . In this example deconvolving with gives , which is not a valid distribution. That is, in this example we can make the decision that has changed. We are working on the conditions to guarantee that only one of the two cases is possible.

### 4.3 Estimating causal conditionals

Given , estimate under the assumption that remained constant. Assume that and have been generated by the additive noise model

 E=ϕ(C)+NE,

with the same and , while the distribution of has changed. We have

 P(E) = P(ϕ(C))∗P(NE), P′(E) = P(ϕ(C))∗P′(NE).

Hence, can be obtained by the deconvolution

 P′(NE)=P(ϕ(C))∗−1P′(E).

This way, we can compute the new conditional .

### 4.4 Conditional Anm

Given two data sets generated by

 E=ϕ(C)+NE (2)

and

 E′=ϕ(C′)+N′E, (3)

respectively. We apply the algorithm of [12] to obtain the shared function , enforcing separate independence and .

This can be interpreted as a ANM model enforcing conditional independence in

 E|i=ϕ(C|i)+NE|i, (4)

where is an index, and .

##### Acknowledgement

We thank Joris Mooij, Bob Williamson, Vladimir Vapnik, Jakob Zscheischler and Eleni Sgouritsa for helpful discussions.

## References

• [1] M. Sugiyama and M. Kawanabe. Machine Learning in Non-Stationary Environment. MIT Press, Cambridge, MA, 2012.
• [2] J. Lemeire and E. Dirkx. Causal models as minimal descriptions of multivariate systems. , 2007.
• [3] D. Janzing and B. Schölkopf. Causal inference using the algorithmic Markov condition. IEEE Transactions on Information Theory, 56(10):5168–5194, 2010.
• [4] G. Schweikert, C. Widmer, B. Schölkopf, and G. Rätsch. An empirical analysis of domain adaptation algorithms for genomic sequence analysis. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems, volume 21, pages 1433–1440, 2009.
• [5] J. Pearl. Causality. Cambridge University Press, 2000.
• [6] P. Spirtes, C. Glymour, and R. Scheines. Causation, prediction, and search. Springer-Verlag. (2nd edition MIT Press 2000), 1993.
• [7] P. O. Hoyer, D. Janzing, J. M. Mooij, J. Peters, and B. Schölkopf. Nonlinear causal discovery with additive noise models. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems, volume 21, pages 689–696, 2009.
• [8] J. Peters, J. M. Mooij, D. Janzing, and B. Schölkopf. Identifiability of causal graphs using functional models. In Proceedings of the 27th Conference on UAI, pages 589–598, 2011.
• [9] P. Daniušis, D. Janzing, J. Mooij, J. Zscheischler, B. Steudel, K. Zhang, and B. Schölkopf. Inferring deterministic causal relations. In

26th Conference on Uncertainty in Artificial Intelligence

, Corvallis, OR, USA, 07 2010. AUAI Press.
• [10] J. Peters, D. Janzing, and B. Schölkopf. Causal inference on discrete data using additive noise models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33:2436–2450, 2011.
• [11] K. Zhang and A. Hyvärinen. On the identifiability of the post-nonlinear causal model. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, Montreal, Canada, 2009.
• [12] J. Mooij, D. Janzing, J. Peters, and B. Schölkopf. Regression by dependence minimization and its application to causal inference in additive noise models. In A. Danyluk, L. Bottou, and M. Littman, editors, Proceedings of the 26th International Conference on Machine Learning, New York, NY, USA, 06 2009. ACM Press.
• [13] O. Chapelle, B. Schölkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press, Cambridge, MA, USA, 09 2006.
• [14] X. Zhu and A. Goldberg. Introduction to semi-supervised learning. In Synthesis Lectures on Artificial Intelligence and Machine Learning, volume 3, pages 1–130. Morgan & Claypool Publishers, 2009.