1 Introduction
By and large, statistical machine learning exploits statistical associations or dependences between variables to make predictions about certain variables. This is a very powerful concept especially in situations where we have sizable training sets, but no detailed model of the underlying data generating process. This process is usually modelled as an unknown probability distribution, and machine learning excels whenever this distribution does not change. Most of the theoretical analysis assumes that the data are i.i.d. (independent and identically distributed) or at least exchangeable.
On the other hand, practical problems often do not have these favorable properties, forcing us to leave the comfort zone of i.i.d. data. Sometimes distributions shift over time, sometimes we might want to combine data recorded under different conditions or from different but related regularities. Researchers have developed a number of modifications of statistical learning methods to handle various scenarios of changing distributions, for an overview, see [1].
The present paper attempts to study these problems from the point of view of causal learning. As some other recent work in the field [2, 3], it will build on the assumption that in causal structures, the distribution of the cause and the mechanism relating cause and effect tend to be independent^{1}^{1}1In the mentioned references, independence is meant in the sense of algorithmic independence, but other notions of independence can also make sense.. For instance, in the problem of predicting splicing patterns from genomic sequences, the basic splicing mechanism (driven by the ribosome) may be assumed stable betwen different species [4], even though the genomic sequences and their statistical properties might differ in several respects. This is important information constraining causal models, and it can also be useful for robust predictive models as we try to show in the present paper. Intuitively, if we learn a causal model of splicing, we could hope to be more robust with respect to changes of the input statistics, and we may be able to combine data collected from different species to get a more accurate statistical model of the splicing mechanism.
Causal graphical models as pioneered by [5, 6] are usually thought of as joint probability distributions over a set of variables , along with directed graphs (for simplicity, we assume acyclicity) with vertices and arrows indicating direct causal influences. The causal Markov assumption [5] states that each vertex is independent of its nondescendants in the graph, given its parents. Here, independence is usually meant in a statistical sense, although alternative views have been developed, e.g., using algorithmic independence [3].
Crucially, the causal Markov assumption links the semantics of causality to something that has empirically measurable consequences (e.g., conditional statistical independence). Given a sufficient set of observations from a joint distribution, it allows us to test conditional independence statements and thus infer (subject to a genericity assumption referred to as “faithfulness”) which causal models are consistent with an observed distribution. However, this will typically not lead us to a unique causal model, and in the case of graphs with only two variables, there are no conditional independence statements to test and we cannot do anything.
There is an alternative view of causal models, which does not start from a joint distribution. Instead, it assumes a set of jointly independent noise variables, one at each vertex, and each vertex computes a deterministic function of its noise variables and its parents. This view, referred to as a functional causal model (or nonlinear structural equation model), entails a joint distribution which along with the graph satisfies the causal Markov assumption [5]. Vice versa, each causal graphical model can be expressed as a functional causal model [3, e.g.]. ^{2}^{2}2As an aside, note that the functional point of view is more specific than that graphical model view [5]. To see this consider and the following two functional models that lead to the same joint distribution: (1) with and (2) with and . Suppose one observes the sample . (1) and (2) give different answers to the counterfactual question “What would have happened if had been one?”. The causal graph and the joint distribution does not provide sufficient information to give any answer.
The functional point of view is rather useful in that it allows us to come up with assumptions on causal models that would be harder to conceive in a pure probabilistic view. It has recently been shown [7] that an assumption of nonlinear functions with additive noise renders the two variable case (and thus the multivariate case [8]) identifiable, i.e., we can distinguish between the causal structures and , given that one and only one of these two alternatives is true (which implicitly excludes a common cause of and ). Hence, we can tackle the case where conditional independence tests do not provide any information. This opens up the possibility to identify the causal direction for inputoutput learning problems. The present paper assays whether this can he helpful for machine learning, and it argues that in many situations, a causal model can be more robust under distribution shifts than a pure statistical model. Perhaps somewhat suprisingly, learning problems need not always predict effect from cause, and the direction of the prediction has consequences for which tasks are easy and which tasks are hard. In the remainder of the paper, we restrict ourselves to the simplest possible case, where we have two variables only and there are no unobserved confounders.
Notation.
We consider the causal structure shown in Fig. 1
, with two observables, modeled by random variables. When using the notation
and , the variable stands for the cause and for the effect. We denote their domains and and their distributions by and (overloading the notation ). When using the notation and , the variable will always be the input and the output, from a machine learning point of view (but input and output can either be cause of effect — more below).For simplicity, we assume that their distributions have a joint density with respect to some product measure. We write the values of this density as and the values of the marginal densities as and , again keeping in mind that these three are different functions — we can always tell from the argument which function is meant.
We identify a training set of size with a uniform mixture of Dirac measures, denoted as and use an analogous notation for an additional data set of size (e.g., a set of test inputs). E.g., could be a set of test inputs sampled from a distribution that need not be identical with . The following assumptions are used throughout the paper. The subsections below only mention additional assumptions that are task specific.
Causal sufficiency.
We further assume that there are two independent noise variables and , modeled as random variables with domains and and distributions and . In some places, we will use conditional densities, always implicitly assuming that they exist.
The function and the noise term jointly determine the conditional via
We think of as the mechanism transforming cause into effect .
Indepence of mechanism and input.
We finally assume that the mechanism is “independent” of the distribution of the cause (i.e., independent of in Fig. 1), in the sense that contains no information about and vice versa; in particular, if changes at some point in time, there is no reason to believe that changes at the same time.^{3}^{3}3A stronger condition, which we do not need in the present context, would be to require that , and be jointly “independent.” This assumption has been used by [2, 3]. It encapsulates our belief that is a mechanism of nature that does not care what we feed into it. The assumption introduces an important asymmetry between cause and effect, since it will usually be violated in the backward direction, i.e., the distribution of the effect will inherit properties from [3, 9].
Richness of functional causal models
It turns out that the twovariable functional causal model is so rich that it cannot be identified. The causal Markov condition is trivially satisfied both by the forward model and the backward model, and thus both graphs allow a functional model.
To understand the richness of the class intuitively, consider the simple case where the noise can take only a finite number of values, say . This noise could affect for instance as follows: there is a set of functions , and the noise randomly switches one of them on at any point, i.e.,
The functions could implement arbitrarily different mechanisms, and it would thus be very hard to identify from empirical data sampled from such a complex model.^{4}^{4}4A similar construction, with the range of the noise having the cardinality of the function class, can be used [3] to argue that every causal graphical model can be expressed as a functional causal model.
As an aside, recall that for acyclic causal graphs with more than two variables, the graph structure will typically imply conditional independence properties via the causal Markov condition. However, the above construction with noises randomly switching between mechanisms is still valid, and it is thus surprising that conditional independence alone does allow us to do some causal inference of practical significance, as implemented by the well known PC and FCI algorithms [6, 5]. It should be clear that additional assumptions that prevent the noise switching construction should significantly facilitate the task of identifying causal graphs from data. Intuitively, such assumptions need to control the complexity with which the noise given a training set plus two unpaired sets from the two original marginals.
Additive noise models.
One such assumption is referred to as ANM, standing for nonlinear nonGaussian acyclic model [7]. This model assumes for some function :
(1) 
and it has been shown that and can be identified in the generic case, provided that is assumed to have zero mean. This means that apart from some exceptions, such as the case where is linear and is Gaussian, a given joint distribution of two realvalued random variables and can be fit by an ANM model in at most one direction (which we then consider the causal one).
A similar statement has been shown for discrete data [10] and for the postnonlinear ANM model [11]
where is an invertible function.
In practice, an ANM model can be fit by regressing the effect on the cause while enforcing that the residual noise variable is independent of the cause [12]. If this is impossible, the model is incorrect (e.g., cause and effect are interchanged, the noise is not additive, or there are confounders).
ANM plays an important role in this paper; first, because all the methods below will presuppose that we know what is cause and what is effect, and second, because we will generalize ANM to handle the case where we have several models of the form (1) that share the same .
2 Predicting Effect from Cause
Let us consider the case where we are trying to estimate a function or a conditional distribution in the causal direction, i.e., that is the cause and the effect. Intuitively, this situation of causal prediction should be the ’easy’ case since there exists a functional mechanism which should try to mimic. We are interested in the question how robust (or invariant) the estimation is with respect to changes in the noise variables of the underlying functional causal model.
2.1 Additional information about the input
2.1.1 Robustness w.r.t. input changes (distribution shift)
Given:
training points sampled from and an additional set of inputs sampled from , with .
Goal:
estimate .
Assumption:
none.
Solution:
by independence of mechanism and input, there is no reason to assume that the observed change in (i.e., in ) entails a change in , and we thus conclude . This scenario is referred to as covariate shift [1].
2.1.2 Semisupervised learning
Given:
training points sampled from and an additional set of inputs sampled from .
Goal:
estimate .
Note:
by independence of the mechanism, contains no information about . A more accurate estimate of , as may be possible by the addition of the test inputs , does thus not influence an estimate of , and semisupervised learning (SSL) is pointless for the scenario in Figure 2.
2.2 Additional information about the output
2.2.1 Robustness w.r.t. output changes
Given:
training points sampled from and an additional set of outputs sampled from , with .
Goal:
estimate .
Assumption:
various options, e.g., an additive Gaussian noise model where is indecomposable and is also indecomposable, if it is different from .
Solution:
first we need to decide whether or has changed. This can be done using the method Localizing Distribution Change (Subsection 4.2) under appropriate assumptions (see above). If has changed, proceed as in Subsubsection 2.1.1. If has changed, we can, estimate via Estimating Causal Conditional (Subsection 4.3). Here, additive noise is a sufficient assumption.
2.2.2 Semisupervised learning
Given:
training points sampled from and an additional set of outputs sampled from .
Goal:
estimate .
Assumption:
has an additive noise model from to and has a unique decomposition as convolution of two distributions, say . This is, for instance satisfied if the noise is Gaussian and is indecomposable.
Solution:
The additional outputs help because the decomposition tells us that either or . The additive noise model learned from the pairs will probably tell us which of the alternatives is true. Knowing , the conditional reduces to learning from the pairs, which is certainly a weaker problem than learning would be in general.
2.3 Additional information about input and output
2.3.1 Transfer learning (only nosie changes)
Given:
training points sampled from and an additional set of points sampled from , with .
Goal:
estimate .
Assumption:
Additive noise where is invariant, but the noises can change.
Solution:
run conditional ANM to output a single function, only enforcing independence of residuals separately for the two data sets (Section 4.4).
There is also a semisupervised learning variant of this scenario: Given given a training set plus two unpaired sets from the two original marginals, then the extra sets help to better estimate because we have argued in Subsubsection 2.2.2 that additional values sampled from already help.
2.3.2 Concept drift (only meachnism changes)
Given:
training points sampled from and an additional set of points sampled from , with .
Goal:
estimate .
Assumption:
invariant, but has changed.
Solution:
Apply ANM to points sampled from to obtain . Then is given by
3 Predicting Cause from Effect
We now turn to the opposite direction, where we consider the effect as observed and we try to predict the value of the cause variable that led to it. This situation of anticausal prediction may seem unnatural, but it is actually ubiquitous in machine learning. Consider, for instance, the task of predicting the class label of a handwritten digit from its image. The underlying causal structure is as follows: a person intends to write the digit 7, say, and this intention causes a motor pattern producing an image of the digit 7 — in that sense, it is justified to consider the class label the cause of the image .
3.1 Additional information about the input
3.1.1 Robustness w.r.t. input changes (distribution shift)
Given:
training points sampled from and an additional set of inputs sampled from , with .^{5}^{5}5A related scenario is that we do not have additional data from , but we want to still use our knowledge of the causal direction to learn a model that is somewhat robust w.r.t. changes of due to changes in either or .
Goal:
estimate .
Assumption:
additive Gaussian noise with invertible function and indecomposable is sufficient. Other assumptions are also possible, but invertibility of the causal conditional is necessary in any case.
Solution:
We apply Localizing Distribution Change (Subsection 4.2) to decide if or has changed. In the first case, we can estimate via Inverting Conditionals (Subsection 4.1) if we assume that is an injective conditional.^{6}^{6}6This term will be introduced in Subsection 4.1. Injectivity means that the input distribution can uniquely be computed from the output distribution. We will give examples of injective conditionals later. From this we get , and then
If, of the other hand, has changed, we can estimate via Estimating Causal Conditionals (Subsection 4.3).
3.1.2 Semisupervised learning
Given:
training points sampled from and an additional set of inputs sampled from .
Goal:
estimate .
Assumption:
unclear.
Note:
by dependence of the mechanism, contains information about . The additional inputs thus may allow a more accurate estimate of .^{7}^{7}7Note that a weak form of SSL could roughly work as followst: after learning a generative model for from the first part of the sample, we can use the additional samples from to double check whether our model generates the right distribution for .
Known methods for semisupervised learning can indeed be viewed in this way. For instance, the cluster assumption says that points that lie in the same cluster of should have the same
; and the low density separation assumption says that the decision boundary of a classifier (i.e., the point where
crosses ) should lie in a region where is small. The semisupervised smoothness assumption says that the estimated function (which we may think of as the expectation of should be smooth in regions where is large (for an overview of the common assumptions, see [13]). Some algorithms assume a model for the causal mechanism,, which is usually a Gaussian distribution or mixture of Gaussians, and learn it on both labeled and unlabeled data
[14]. Note that all these assumptions translate properties of into properties of .Using a more accurate estimate of , we could also try to proceed as in Subsubsection 3.1.1.^{8}^{8}8However, in this case we do not have the two alternatives of whether or has changed. The question now should be: given a better estimate of , does that change our estimate of , or of ?
3.2 Additional information about the output
3.2.1 Robustness w.r.t. output changes
Given:
training points sampled from and an additional set of outputs sampled from , with .
Goal:
estimate .
Assumption:
none.
Solution:
independence of mechanism implies , hence . From this, we compute
There may also be room for a semisupervised learning variant: suppose we have additional output observations rather than additional inputs as in standard SSL — in which situations does this help?
3.3 Additional information about input and output
3.3.1 Robustness w.r.t. changes of input and output noise (transfer learning)
Given:
training points sampled from and an additional set of points sampled from , with .
Goal:
estimate .
Assumption:
additive noise where is invariant, but the noises can change.
Solution:
analogous to Subsection 2.3.1, but use the model backwards in the end.
3.3.2 Concept drift (changes of the mechanism)
Given:
training points sampled from and an additional set of points sampled from , with .
Goal:
estimate .
Assumption:
invariant, but has changed to .
Solution:
We can learn from and then estimate the entire distribution using the estimations of our distributions and obtained from observing those pairs that were taken from .
4 Modules
4.1 Inverting Conditionals
We can think of a conditional as a mechanism that transforms into . In some cases, we do not loose any information by this mechanism:
Definition 1 (injective conditionals)
a conditional distribution is called injective if there are no two distributions such that
Example 1 (full rank stochastic matrix)
Let have finite range. Then
is given by a stochastic matrix
and is injective if and only if has full rank. Note that this is only possible if .Example 2 (Postnonlinear model)
Let be realvalued and
be a postnonlinear model where and are injective. Then the distribution of uniquely determines the distribution of because is invertible. This in turn, uniquely determines the distribution of provided that the convolution with is invertible. Since is invertible, this determines the distribution of uniquely.
Note that additive noise models with injective are a special case of a postnonlinear model by setting .
4.2 Localizing distribution change
Given data points sampled from and additional points from , we wish to decide whether or has changed. Assume
with the same for both distributions and , but the distribution of the noise or the distribution of changes. Let denote the distribution of .^{9}^{9}9Explicitly, it is derived from the distribution of by .
Then the distributions of the effect are given by
where either or . To decide which of these cases is true, we first estimate from the first data set, and then apply a deconvolution with (denoted with ) or to and check whether (1) or (2) is a probability distribution. Below we will dicuss one possible set of assumptions that ensure that exactly one of the alternatives should is true. In case (1), has changed. In case (2), has changed.
To show that there are (not too artificial) asumptions that render the problem solvable, assume that and are indecomposable and and are Gaussian with zero mean. Then the distribution uniquely determines by deconvolving with the Gaussian of maximal possible width that still yields a probability distribution.
We are aware that there exist situations where both cases are possible. For instance, consider the example in which
follows a uniform distribution,
, while when generating , and . That is, when generating the new data, only was changed. However, applying the deconvolution with to results in , which still corresponds a valid distribution. Consequently, we have to conclude that both cases are possible.Despite the examples where the proposed method fails, the proposed method also works in – hopefully – many situations. For instance, now let us switch the roles of and in the example above, or in other words, suppose and . In this example deconvolving with gives , which is not a valid distribution. That is, in this example we can make the decision that has changed. We are working on the conditions to guarantee that only one of the two cases is possible.
4.3 Estimating causal conditionals
Given , estimate under the assumption that remained constant. Assume that and have been generated by the additive noise model
with the same and , while the distribution of has changed. We have
Hence, can be obtained by the deconvolution
This way, we can compute the new conditional .
4.4 Conditional Anm
Given two data sets generated by
(2) 
and
(3) 
respectively. We apply the algorithm of [12] to obtain the shared function , enforcing separate independence and .
This can be interpreted as a ANM model enforcing conditional independence in
(4) 
where is an index, and .
Acknowledgement
We thank Joris Mooij, Bob Williamson, Vladimir Vapnik, Jakob Zscheischler and Eleni Sgouritsa for helpful discussions.
References
 [1] M. Sugiyama and M. Kawanabe. Machine Learning in NonStationary Environment. MIT Press, Cambridge, MA, 2012.
 [2] J. Lemeire and E. Dirkx. Causal models as minimal descriptions of multivariate systems. http://parallel.vub.ac.be/jan/, 2007.
 [3] D. Janzing and B. Schölkopf. Causal inference using the algorithmic Markov condition. IEEE Transactions on Information Theory, 56(10):5168–5194, 2010.
 [4] G. Schweikert, C. Widmer, B. Schölkopf, and G. Rätsch. An empirical analysis of domain adaptation algorithms for genomic sequence analysis. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems, volume 21, pages 1433–1440, 2009.
 [5] J. Pearl. Causality. Cambridge University Press, 2000.
 [6] P. Spirtes, C. Glymour, and R. Scheines. Causation, prediction, and search. SpringerVerlag. (2nd edition MIT Press 2000), 1993.
 [7] P. O. Hoyer, D. Janzing, J. M. Mooij, J. Peters, and B. Schölkopf. Nonlinear causal discovery with additive noise models. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems, volume 21, pages 689–696, 2009.
 [8] J. Peters, J. M. Mooij, D. Janzing, and B. Schölkopf. Identifiability of causal graphs using functional models. In Proceedings of the 27th Conference on UAI, pages 589–598, 2011.

[9]
P. Daniušis, D. Janzing, J. Mooij, J. Zscheischler, B. Steudel, K. Zhang,
and B. Schölkopf.
Inferring deterministic causal relations.
In
26th Conference on Uncertainty in Artificial Intelligence
, Corvallis, OR, USA, 07 2010. AUAI Press.  [10] J. Peters, D. Janzing, and B. Schölkopf. Causal inference on discrete data using additive noise models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33:2436–2450, 2011.
 [11] K. Zhang and A. Hyvärinen. On the identifiability of the postnonlinear causal model. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, Montreal, Canada, 2009.
 [12] J. Mooij, D. Janzing, J. Peters, and B. Schölkopf. Regression by dependence minimization and its application to causal inference in additive noise models. In A. Danyluk, L. Bottou, and M. Littman, editors, Proceedings of the 26th International Conference on Machine Learning, New York, NY, USA, 06 2009. ACM Press.
 [13] O. Chapelle, B. Schölkopf, and A. Zien, editors. SemiSupervised Learning. MIT Press, Cambridge, MA, USA, 09 2006.
 [14] X. Zhu and A. Goldberg. Introduction to semisupervised learning. In Synthesis Lectures on Artificial Intelligence and Machine Learning, volume 3, pages 1–130. Morgan & Claypool Publishers, 2009.
Comments
There are no comments yet.