1 Introduction
Predicting unknown values based on observed data is a problem central to many sciences, and well studied in statistics and machine learning. This problem becomes significantly harder if the training and test data do not have the same distribution, for example because they come from different domains. Such a distribution shift can happen whenever the circumstances under which the training data were gathered are different from those for which the predictions are to be made. A rich literature exists on this problem of
domain adaptation, a particular task in the field of transfer learning; see e.g. QuiñoneroCandela et al. (2009); Pan and Yang (2010) for overviews.When the domain changes, so may the relations between the different variables under consideration. While for some sets of variables , a function learned in one domain may continue to offer good predictions for in a different domain, this may not be true of other sets of variables. Causal graphs (e.g., Pearl, 2009; Spirtes et al., 2000) allow us to reason about this in a principled way when the domains correspond to different external interventions on the system, or more generally, to different contexts in which a system has been measured. Knowledge of the causal graph that describes the data generating mechanism, and of which parts of the model are invariant across the different domains, allows one to transfer knowledge from one domain to the other in order to address the problem of domain adaptation (Spirtes et al., 2000; Storkey, 2009; Schölkopf et al., 2012; Bareinboim and Pearl, 2016).
Over the last years, various methods have been proposed to exploit the causal structure of the data generating process in order to address certain domain adaptation problems, each relying on different assumptions. For example, Bareinboim and Pearl (2016) provide theory for identifiability under transfer (“transportability”) assuming that the causal graph is known, that interventions are perfect, and that the intervention targets are known. Hyttinen et al. (2015) also assume perfect interventions with known targets but do not rely on complete knowledge of the causal graph, instead inferring the relevant aspects of it from the data. RojasCarulla et al. (2018) make the assumption that if the conditional distribution of the target given some subset of covariates is invariant across different source domains, then this conditional distribution must also be the same in the target domain. The methods proposed in (Schölkopf et al., 2012; Zhang et al., 2013, 2015; Gong et al., 2016) all address challenging settings in which conditional independences that follow from the usual Markov and faithfulness assumptions alone do not suffice to solve the problem, but additional assumptions on the data generating process have to be made.
In this work, we will make no such additional assumptions, and address the setting in which both the causal graph and the intervention types and targets may be (partially) unknown. Our contributions are the following. We consider a set of relatively weak assumptions that make the problem wellposed. We propose an approach to solve this class of causal domain adaptation problems that can deal with the presence of latent confounders. The main idea is to select the subset of features that leads to the best predictions of in the source domains, while satisfying invariance (i.e., is the same in the source and target domains). To test whether the invariance condition is satisfied, we apply the recently proposed Joint Causal Inference (JCI) framework (Mooij et al., 2018)
to exploit the information provided by multiple domains corresponding to different interventions. The basic idea is as follows. First, a standard feature selection method is applied to source domains data to find sets of features that are predictive of a target variable, trading off bias and variance, but unaware of changes in the distribution across domains. A causal inference method then draws conclusions from all given data about the possible causal graphs, avoiding sets of features for which the predictions would not transfer to the target domains. We propose a proofofconcept implementation of our approach building on a causal discovery algorithm by
Hyttinen et al. (2014). We evaluate the method on synthetic data and a realworld example.2 Theory
Before giving a precise definition of the class of domain adaptation problems that we consider in this work, we begin with a motivating example.
Example 1.
We are given three variables describing different aspects of a system (for example, certain blood cell phenotypes in mice). We have observational measurements of these three variables (the source domain, designated with ), and in addition, measurements of and under an intervention (the target domain, designated with ), e.g., in which the mice have been exposed to a certain drug. The domain adaptation task is to predict the values of in the interventional target domain (i.e., when ). Let us assume for this example that the causal graph in Figure 1a applies, i.e., we assume that is affected by and affects , while affects both and (i.e., the intervention targets the variables and ). This causal graph implies . Suppose further that the relation between and is about equally strong as the relation between and , but considerably more noisy. Then a feature selection method using only available source domain data, and aiming to select the best subset of features to use for prediction of will prefer both and over (because predicting from leads to larger variance than predicting from , and to a larger bias than predicting from both and ). However, under the intervention (), and both change,^{1}^{1}1More precisely, we should say that may differ from , and similarly when conditioning on . so that using those features to predict in the target domain could lead to extreme bias, as illustrated in Figure 1c. Because the conditional distribution of given is invariant across domains, as illustrated in Figure 1b, predictions of based only on can be safely transferred to the target domain.
This example provides an instance of a domain adaptation problem where feature selection methods that do not take into account the causal structure would pick a set of features that does not generalize to the target domain, and may lead to arbitrarily bad predictions (even asymptotically, as the number of data points tends to infinity). On the other hand, correctly taking into account the causal structure and the possible distribution shift from source to target domain allows to upper bound the prediction error in the target domain, as we will see in Section 2.3.
2.1 Problem Setting
We now formalize the domain adaptation problems that we address in this paper. We will make use of the terminology of the recently proposed Joint Causal Inference (JCI) framework (Mooij et al., 2018).
Let us consider a system of interest described by a set of system variables . In addition, we model the domain in which the system has been measured by context variables (we will use “context” as a synonym for “domain”). We will denote the tuple of all system and context variables as . System and context variables can be discrete or continuous. As a concrete example, the system of interest could be a mouse. The system variables could be blood cell phenotypes such as the concentration of red blood cells, the concentration of white blood cells, and the mean red blood cell volume. The context variables could indicate for example whether a certain gene has been knocked out, the dosage of a certain drug administered to the mice, the age and gender of the mice, or the lab in which the measurements were done. The important underlying assumption is that context variables are exogenous to the system, whereas system variables are endogenous. The interventions are not limited to the perfect (“surgical”) interventions modeled by the dooperator of Pearl (2009), but can also be other types of interventions such as mechanism changes (Tian and Pearl, 2001), soft interventions (Markowetz et al., 2005), fathand interventions (Eaton and Murphy, 2007), activity interventions (Mooij and Heskes, 2013), and stochastic versions of all these. Knowledge of the intervention targets is not necessary (but is certainly helpful). For example, administering a drug to the mice may have a direct causal effect on an unknown subset of the system variables, but we can simply model it as a binary exogenous variable (indicating whether or not the drug was administered) or a continuous exogenous variable (describing the dosage of the administered drug) without specifying in advance on which variables it has a direct effect. We can now formally state the domain adaptation task that we address in this work:
Task 1 (Domain Adaptation Task).
We are given data for a single or for multiple source domains, in each of which , and for a single or for multiple target domains, in each of which . Assume the source domains data is complete (i.e., no missing values), and the target domains data is complete with the exception of all values of a certain target variable . The task is to predict these missing values of the target variable given the available source and target domains data.
An example is provided in Figure 2. In the next subsection, we will formalize our assumptions to turn this task into a wellposed problem.
2.2 Assumptions
Our first main assumption is that the data generating process (on both system and context variables) can be represented as a Structural Causal Model (SCM) (see e.g., (Pearl, 2009)):
(1) 
Here, we introduced exogenous latent independent “noise” variables that model latent causes of the context and system variables. The parents of each variable are denoted by
. Each context and system variable is related to its parent variables by a structural equation. In addition, we assume a factorizing probability distribution on the exogenous variables. There could be cyclic dependencies, for example due to feedback loops, but for simplicity of exposition we will discuss only the acyclic case here, noting that the extension to the cyclic case is straightforward given recent theoretical advances on cyclic SCMs
(Bongers et al., 2018). This SCM provides a causal model for the distributions of the various domains, and in particular, it induces a joint distribution
on the context and system variables. Note that we will assume that the data generating process can be modeled by some model of this form, but we do not rely on knowing the precise model.The SCM can be represented graphically by its causal graph , a graph with nodes (i.e., the labels of both system and context variables), directed edges for iff , and bidirected edges for iff there exists a . In the acyclic case, this causal graph is an Acyclic Directed Mixed Graph (ADMG), and is also known as a SemiMarkov Causal Model (see e.g., (Pearl, 2009)). The directed edges represent direct causal relationships, and the bidirected edges may represent hidden confounders (both relative to the set of variables in the ADMG). The (causal) Markov assumption holds (Richardson, 2003), i.e., any dseparation
between sets of random variables
in the ADMG implies a conditional independence in the distribution induced by the SCM . A standard assumption in causal discovery is that the joint distribution is faithful with respect to the ADMG , i.e., that there are no other conditional independences in the joint distribution than those implied by dseparation.We will make the following assumptions on the causal structure (where henceforth we will simply write instead of ), which are discussed in detail by Mooij et al. (2018):
Assumption 1 (JCI Assumptions).
Let be a causal graph with variables (consisting of system variables and context variables ).

No system variable directly causes any context variable (“exogeneity”)
(); 
No system variable is confounded with a context variable (“randomization”)
(); 
Every pair of context variables is purely confounded (“genericity”)
().
The first assumption is the most crucial one that captures what we mean by “context”. The other two assumptions are less crucial and could be omitted, depending on the application. For a more indepth discussion of these modeling assumptions and on how they compare with other possible causal modeling approaches, we refer the reader to (Mooij et al., 2018). Any causal discovery method can in principle be used in the JCI setting, but identifiability greatly benefits from taking into account the background knowledge on the causal graph from Assumption 1.
In addition, in order to be able to address the causal domain adaptation task, we will assume:
Assumption 2.
Let be a causal graph with variables (consisting of system variables and context variables ), and be the corresponding distribution on . Let be the source/target domains indicator and the target variable.

The distribution is Markov and faithful w.r.t. ;

Any conditional independence involving in the source domains also holds in the target domains, i.e., if contains but not then:^{2}^{2}2Here, with we mean , i.e., the conditional independence of from given in the mixture of the source domains , and similarly for the target domains.

has no direct effect on w.r.t. , i.e., .
The Markov and faithfulness assumptions are standard in constraintbased causal discovery on a single domain; we apply them here on the “metasystem” composed of system and context.
Assumption 2(ii) may seem nonintuitive, but as we show in the Supplementary Material, it follows from more intuitive (but stronger) assumptions, for example if both the pooled source domains distribution and the pooled target domains distribution are Markov and faithful to the subgraph of which excludes . These stronger assumptions imply that the causal structure (i.e., presence or absence of causal relationships and confounders) of the other variables is invariant when going from source to target domains. Assumption 2(ii) is a weakened version of these more natural assumptions, allowing additional independences to hold in the target domains compared to the source domains, e.g., when models a perfect surgical intervention.
Assumption 2(iii) is strong, yet some assumption of that type seems necessary to make the task welldefined. Without any information at all about the target(s) of , or the causal mechanism that determines the values of in the target domains, predicting the values of for the target domains seems generally impossible. Note that the assumption is more likely to be satisfied if the interventions are believed to be precisely targeted, and gets weaker the more relevant system variables are observed.^{3}^{3}3This assumption can be weakened further: in some circumstances one can infer from the data and the other assumptions that cannot have a direct effect on . For example: if there exists a descendant , and if there exists a set , such that , then is not a direct cause of w.r.t. . For some proposals on alternative assumptions that can be made when this assumption is violated, see e.g., (Schölkopf et al., 2012; Zhang et al., 2013, 2015; Gong et al., 2016).
As one example of a realworld setting in which these assumptions are reasonable, consider a genomics experiment, in which gene expression levels of many different genes are measured in response to knockouts of single genes. Given our presentday understanding of the biology of gene expression, it is very reasonable to assume that the knockout of gene only has a direct effect on the expression level of gene itself. As long as we do not ask to predict the expression level of under a knockout of , but only the expression level of other genes with , Assumption 2(iii) seems justified. It is also reasonable (based on presentday understanding of biology) to expect that a single gene knockout does not change the causal mechanisms in the rest of the system. This justifies Assumption 2(ii) in this setting if one is willing to assume faithfulness.
In the next subsections, we will discuss how these assumptions enable us to address the domain adaptation task.
2.3 Separating Sets of Features
Our approach to addressing Task 1 is based on finding a separating set of (context and system) variables that satisfies . If such a separating set can be found, then the distribution of conditional on is invariant under transferring from the source domains to the target domains, i.e.,
. As the former conditional distribution can be estimated from the source domains data, we directly obtain a prediction for the latter, which then enables us to predict the values of
from the observed values of in the target domains.^{4}^{4}4This trivial observation is not novel; see e.g. (Ch. 7, p. 164, Spirtes et al., 2000). It also follows as a special case of (Theorem 2, Pearl and Bareinboim, 2011). The main novelty of this work is the proposed strategy to identify such separating sets.We will now discuss the effect of the choice of
on the quality of the predictions. For simplicity of the exposition, we make use of the squared loss function and look at the asymptotic case, ignoring finitesample issues. When predicting
from a subset of features (that may or may not be separating), the optimal predictor is defined as the function mapping from the range of possible values of to the range of possible values of that minimizes the target domains risk , and is given by the conditional expectation (regression function) . Since is not observed in the target domains, we cannot directly estimate this regression function from the data.One approach that is often used in practice is to ignore the difference in distribution between source and target domains, and use instead the predictor , which minimizes the source domains risk . This approximation introduces a bias that we will refer to as the transfer bias (when predicting from ). When ignoring that source domains and target domains have different distributions, any standard machine learning method can be used to predict from . As the transfer bias can become arbitrarily large (as we have seen in Example 1), the prediction accuracy of this solution strategy may be arbitrarily bad (even in the infinitesample limit).
Instead, we propose to only predict from when the set of features satisfies the following separating set property:
(2) 
i.e., it dseparates from in . By the Markov assumption, this implies . In other words (as already mentioned above), for separating sets, the distribution of conditional on is invariant under transferring from the source domains to the target domains, i.e., . By virtue of this invariance, regression functions are identical for the source domains and target domains, i.e., , and hence also the source domains and target domains risks are identical when using the predictor :
(3) 
The r.h.s. can be estimated from the source domains data, and the l.h.s. equals the generalization error to the target domains when using the predictor trained on the source domains (which equals the predictor that one could obtain if all target domains data, including the values of , were observed).^{5}^{5}5Note that this equation only holds asymptotically; for finite samples, in addition to the transfer from source domains to target domains, we have to deal with the generalization from empirical to population distributions and from the covariate shift if (see e.g. Mansour et al., 2009). Although this approach leads to zero transfer bias, it introduces another bias: by using only a subset of the features , rather than all available features , we may miss relevant information to predict . We refer to this bias as the incomplete information bias, .
The total bias when using to predict is the sum of the transfer bias and the incomplete information bias:
For some problems, one may be better off by simply ignoring the transfer bias and minimizing the incomplete information bias, while for other problems, it is crucial to take the transfer into account to obtain small generalization errors. In that situation, we could use any subset for prediction that satisfies the separating set property (2), implying zero transfer bias; obviously, the best predictions are then obtained by selecting a separating subset that also minimizes the source domains risk (i.e., minimizes the incomplete information bias). We conclude that this strategy of selecting a subset to predict may yield an asymptotic guarantee on the prediction error by (3), whereas simply ignoring the shift in distribution may lead to unbounded prediction error, since the transfer bias could be arbitrarily large in the worst case scenario.
2.4 Identifiability of Separating Feature Sets
For the strategy of selecting the best separating sets of features as discussed in Section 2.3, we need to find one or more sets that satisfy (2). Of course, the problem is that we cannot directly test this in the data, because the values of are missing for . Note that also Assumption 2(ii) cannot be directly used here, because it only applies when is not in . When the causal graph is known, it is easy to verify whether (2) holds directly using dseparation. Here we address the more challenging setting in which the causal graph and the targets of the interventions are (partially) unknown.^{6}^{6}6Another option, proposed by RojasCarulla et al. (2018), is to assume that if is invariant across all source domains (i.e., for all ), then the same holds across all source and target domains (i.e., for all ). This assumption can be violated in some simple cases, e.g. see Example 2. Conceptually, one could estimate a set of possible causal graphs by using a causal discovery algorithm (for example, extending any standard method to deal with the missing conditional independence tests in ), and then read off separating sets from these graphs. In practice, it is not necessary to estimate completely these causal graphs: we only need to know enough about them to verify or falsify whether a given set of features separates from . The following example (with details in the Supplementary Material) illustrates a case where such reasoning allows us to identify a separating set.
Example 2.
Assume that Assumptions 1 and 2 hold for two context variables and three system variables with . If the following conditional (in)dependences all hold in the source domains:
(4) 
then , i.e., is a separating set for and . One possible causal graph leading to those (in)dependences is provided in Figure 2 (the others are shown in Figure 1(a) in the Supplementary Material). For that ADMG, and given enough data, feature selection applied to the source domains data will generically select as the optimal set of features for predicting , which can lead to an arbitrarily large prediction error. On the other hand, the set is separating in any ADMG satisfying (4), so using it to predict leads to zero transfer bias, and therefore provides a guarantee on the target domains risk (i.e., it provides an upper bound on the optimal target domains risk, which can be estimated from the source domains data).
Rather than characterizing by hand all possible situations in which a separating set can be identified (like in Example 2), in this work we delegate the causal inference to an automatic theorem prover. Intuitively, the idea is to provide the automatic theorem prover with the conditional (in)dependences that hold in the data, in combination with an encoding of Assumptions 1 and 2 into logical rules, and ask the theorem prover whether it can prove that holds for a candidate set from the assumptions and provided conditional (in)dependences. There are three possibilities: either it can prove the query (and then we can proceed to predict from and get an estimate of the target domains risk), or it can disprove the query (and then we know will generically give predictions that suffer from an arbitrarily large transfer bias), or it can do neither (in which case hopefully another subset can be found that does provably satisfy (2)).
2.5 Algorithm
A simple (bruteforce) algorithm that finds the best separating set as described in Section 2.3 is the following. By using a standard feature selection method, produce a ranked list of subsets , ordered ascendingly with respect to the empirical source domains risks. Going through this list of subsets (starting with the one with the smallest empirical source domains risk), test whether the separating set property can be inferred from the data by querying the automated theorem prover. If (2) can be shown to hold, use that subset for prediction of and stop; if not, continue with the next candidate subset in the list. If no subset satisfies (2), abstain from making a prediction.^{7}^{7}7Abstaining from predictions can be advantageous when trading off recall and precision. If a prediction has to be made, we can fall back on some other method or simply accept the risk that the transfer bias may be large.
An important consequence of Assumption 2(ii) is that it enables us to transfer conditional independence involving the target variable from the source domains to the target domains (proof provided in the Supplementary Material):
Proposition 1.
To test the separating set condition (2), we use the approach proposed by Hyttinen et al. (2014), where we simply add the JCI assumptions (Assumption 1) as constraints on the optimization problem, in addition to the domainadaptation specific assumption that (Assumption 2(iii)). As inputs we use all directly testable conditional independence test pvalues in the pooled data (when ) and all those resulting from Proposition 1 from the source domains data only (if ). If background knowledge on intervention targets or the causal graph is available, it can easily be added as well. We use the method proposed by Magliacane et al. (2016) to query for the confidence of whether some statement (e.g., ) is true or false. The results of Magliacane et al. (2016) show that this approach is sound under oracle inputs, and asymptotically consistent whenever the statistical conditional independence tests used are asymptotically consistent. In other words, in this way the probability of wrongly deciding whether a subset is a separating set converges to zero as the sample size increases. We chose this approach because it is simple to implement on top of existing open source code.^{8}^{8}8We build on the source code provided by Magliacane et al. (2016) which in turn extends the source code provided by Hyttinen et al. (2014). The full source code of our implementation and the experiments is available online at https://github.com/causam/dom_adapt. Note that the computational cost quickly increases with the number of variables, limiting the number of variables that can be considered simultaneously.
One remaining issue is how to predict when an optimal separating set has been found. As the distribution of may shift when transferring from source domains to target domains, this means that there is a covariate shift to be taken into account when predicting . Any method (e.g., leastsquares regression) could in principle be used to predict from a given set of covariates, but it is advisable to use a prediction method that works well under covariate shift, e.g., (Sugiyama et al., 2008).
3 Evaluation
We perform an evaluation on both synthetic data and a realworld dataset based on a causal inference challenge.^{9}^{9}9Part of the CRM workshop on Statistical Causal Inference and Applications to Genetics, Montreal, Canada (2016). See also http://www.crm.umontreal.ca/2016/Genetics16/competition_e.php The latter dataset consists of hematologyrelated measurements from the International Mouse Phenotyping Consortium (IMPC), which collects measurements of phenotypes of mice with different singlegene knockouts.
In both evaluations we compare a standard feature selection method (which uses Random Forests) with our method that builds on top of it and selects from its output the best separating set. First, we score all possible subsets of features by their outofbag score using the implementation of Random Forest Regressor from
scikitlearn (Pedregosa et al., 2011) with default parameters. For the baseline we then select the best performing subset and predict . Instead, for our proposed method we try to find a subset of features that is also a separating set, starting from the subsets with the best scores. To test whether is a separating set, we use the method described in Section 2.5, using the ASP solver clingo 4.5.4 (Gebser et al., 2014). We provide as inputs the independence test results from a partial correlation test with significance level and combine it with the weighting scheme from Magliacane et al. (2016). We then use the first subset in the ranked list of predictive sets of features found by the Random Forest method for which the confidence that holds is positive. If there is no set that satisfies this criterion, then we abstain from making a prediction.For the synthetic data, we generate randomly 200 linear acyclic models with latent variables and Gaussian noise, each with three system variables, and sample data points each for the observational and two experimental domains, where we simulate soft interventions on randomly selected targets, with different sizes of perturbations. We randomly select which of the two context variables will be and which of the three system variables will be . We disallow direct effects of on , and enforce that no intervention can directly affect all variables simultaneously. More details on how the data were simulated are provided in the Supplementary Material. Figure 2(a) shows a boxplot of the loss of the predicted values with respect to the true values for both the baseline and our method, considering the 121 cases out of 200 in which our method does produce an answer. In particular, Figure 2(a) considers the case of samples per regime and interventions that all produce a large perturbation. In the Supplementary Material we show that results improve with more samples, both for the baseline, but even more so for our method, since the quality of the conditional independence tests improves. We also show that, according to expectations, if the target distribution is very similar to the source distributions, i.e., the transfer bias is small, our method does not provide any benefit and seems to perform worse than the baseline. Conversely, the larger the intervention effect, the bigger the advantage of using our method.
For the realworld dataset, we select a subset of the variables considered in the CRM Causal Inference Challenge. Specifically, for simplicity we focus on 16 phenotypes that are not deterministically related to each other. The dataset contains measurements for 441 “wild type” mice and for about 10 “mutant” mice for each of 13 different single gene knockouts. We then generate 1000 datasets by randomly selecting subsets of 3 variables and 2 gene knockout contexts, and always include also “wild type” mice. For each dataset we randomly choose and , and leave out the observed values of for . Figure 2(b) shows a boxplot of the loss of the predicted values with respect to the real values for the baseline and our method. Given the small size of the datasets, this is a very challenging problem. In this case, our method abstains from making a prediction for 170 cases out of 1000 but performs similarly to the baseline on the remaining cases.
4 Discussion and Conclusion
We have defined a general class of causal domain adaptation problems and proposed a method that can identify sets of features that lead to transferable predictions. Our assumptions are quite general and in particular do not require the causal graph or the intervention targets to be known. The method gives promising results on simulated data. It is straightforward to extend our method to the cyclic case by making use of the results by Forré and Mooij (2018). More work remains to be done on the implementation side, for example, scaling up to more variables. Currently, our approach can handle about seven variables on a laptop computer, and with recent advances in exact causal discovery algorithms (e.g., Rantanen et al., 2018), a few more variables would be feasible. For scaling up to dozens of variables, we plan to adapt constraintbased causal discovery algorithms like FCI (Spirtes et al., 2000)
to deal with the missingdata aspect of the domain adaptation task. We hope that this work will also inspire further research on the interplay between bias, variance and causality from a statistical learning theory perspective.
Acknowledgments
We thank Patrick Forré for proofreading a draft of this work. We thank Renée van Amerongen and Lucas van Eijk for sharing their domain knowledge about the hematologyrelated measurements from the International Mouse Phenotyping Consortium (IMPC). SM, TC, SB, and PV were supported by NWO, the Netherlands Organization for Scientific Research (VIDI grant 639.072.410). SM was also supported by the Dutch programme COMMIT/ under the Data2Semantics project. TC was also supported by NWO grant 612.001.202 (MoCoCaDi), and EUFP7 grant agreement n.603016 (MATRICS). TvO and JMM were supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement 639466).
References
 Bareinboim and Pearl (2016) E. Bareinboim and J. Pearl. Causal inference and the datafusion problem. Proceedings of the National Academy of Sciences, 113(27):7345–7352, 2016.
 Bongers et al. (2018) S. Bongers, J. Peters, B. Schölkopf, and J. M. Mooij. Theoretical aspects of cyclic structural causal models. arXiv.org preprint, arXiv:1611.06221v2 [stat.ME], Aug. 2018. URL https://arxiv.org/abs/1611.06221v2.
 Cooper (1997) G. F. Cooper. A simple constraintbased algorithm for efficiently mining observational databases for causal relationships. Data Mining and Knowledge Discovery, 1(2):203–224, 1997.

Eaton and Murphy (2007)
D. Eaton and K. Murphy.
Exact Bayesian structure learning from uncertain interventions.
In
Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, (AISTATS07)
, volume 2 of Proceedings of Machine Learning Research, pages 107–114, 2007.  Forré and Mooij (2018) P. Forré and J. M. Mooij. Constraintbased causal discovery for nonlinear structural causal models with cycles and latent confounders. In Proceedings of the 34th Annual Conference on Uncertainty in Artificial Intelligence (UAI18), 2018.
 Gebser et al. (2014) M. Gebser, R. Kaminski, B. Kaufmann, and T. Schaub. Clingo = ASP + control: Extended report. Technical report, University of Potsdam, 2014. URL http://www.cs.unipotsdam.de/wv/pdfformat/gekakasc14a.pdf.
 Gong et al. (2016) M. Gong, K. Zhang, T. Liu, D. Tao, C. Glymour, and B. Schölkopf. Domain adaptation with conditional transferable components. In Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), volume 48 of JMLR Workshop and Conference Proceedings, pages 2839–2848, 2016.
 Hyttinen et al. (2014) A. Hyttinen, F. Eberhardt, and M. Järvisalo. Constraintbased causal discovery: Conflict resolution with answer set programming. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, (UAI14), pages 340–349, 2014.
 Hyttinen et al. (2015) A. Hyttinen, F. Eberhardt, and M. Järvisalo. Docalculus when the true graph is unknown. In Proceedings of the ThirtyFirst Conference on Uncertainty in Artificial Intelligence (UAI 2015), pages 395–404, 2015.
 Magliacane et al. (2016) S. Magliacane, T. Claassen, and J. M. Mooij. Ancestral causal inference. In In Proceedings of Advances in Neural Information Processing Systems, (NIPS16), pages 4466–4474, 2016.
 Mansour et al. (2009) Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In Proceedings of the TwentySecond Annual Conference on Learning Theory (COLT 2009), 2009.
 Markowetz et al. (2005) F. Markowetz, S. Grossmann, and R. Spang. Probabilistic soft interventions in conditional Gaussian networks. In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, (AISTATS05), pages 214–221, 2005.
 Mooij and Heskes (2013) J. M. Mooij and T. Heskes. Cyclic causal discovery from continuous equilibrium data. In Proceedings of the 29th Annual Conference on Uncertainty in Artificial Intelligence (UAI13), pages 431–439, 2013.
 Mooij et al. (2018) J. M. Mooij, S. Magliacane, and T. Claassen. Joint causal inference from multiple contexts. arXiv.org preprint, https://arxiv.org/abs/1611.10351v3 [cs.LG], Mar. 2018. URL https://arxiv.org/abs/1611.10351v3.
 Pan and Yang (2010) S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359, Oct. 2010.
 Pearl (2009) J. Pearl. Causality: models, reasoning and inference. Cambridge University Press, 2009.
 Pearl and Bareinboim (2011) J. Pearl and E. Bareinboim. Transportability of causal and statistical relations: A formal approach. In Proceedings of the TwentyFifth AAAI Conference on Artificial Intelligence, pages 247–254, 2011.
 Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
 QuiñoneroCandela et al. (2009) J. QuiñoneroCandela, M. Suyiyama, A. Schwaighofer, and N. D. Lawrence, editors. Dataset Shift in Machine Learning. MIT Press, 2009.
 Rantanen et al. (2018) K. Rantanen, A. Hyttinen, and M. Järvisalo. Learning optimal causal graphs with exact search. In Proceedings of the 9th International Conference on Probabilistic Graphical Models (PGM 2018), volume 72 of Proceedings of Machine Learning Research, pages 344–355, 2018.
 Richardson (2003) T. Richardson. Markov properties for acyclic directed mixed graphs. Scandinavian Journal of Statistics, 30:145–157, 2003.
 RojasCarulla et al. (2018) M. RojasCarulla, B. Schölkopf, R. Turner, and J. Peters. Invariant models for causal transfer learning. Journal of Machine Learning Research, 19(36):1–34, 2018.
 Schölkopf et al. (2012) B. Schölkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, and J. M. Mooij. On causal and anticausal learning. In Proceedings of the 29th International Conference on Machine Learning (ICML 2012), pages 1255–1262, 2012.
 Spirtes et al. (2000) P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT press, 2nd edition, 2000.
 Storkey (2009) A. Storkey. When training and test sets are different: Characterizing learning transfer. In Dataset Shift in Machine Learning, chapter 1, pages 3–28. MIT Press, 2009.
 Sugiyama et al. (2008) M. Sugiyama, S. Nakajima, H. Kashima, P. V. Buenau, and M. Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In In Proceedings of Advances in Neural Information Processing Systems (NIPS08), pages 1433–1440, 2008.
 Tian and Pearl (2001) J. Tian and J. Pearl. Causal discovery from changes. In Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, (UAI01), 2001.
 Zhang et al. (2013) K. Zhang, B. Schölkopf, K. Muandet, and Z. Wang. Domain adaptation under target and conditional shift. In Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 819–827, 2013.
 Zhang et al. (2015) K. Zhang, M. Gong, and B. Schölkopf. Multisource domain adaptation: A causal view. In Proceedings of the TwentyNinth AAAI Conference on Artificial Intelligence, pages 3150–3157, 2015.
Appendix A Supplementary material
a.1 Stronger assumption
We prove that Assumption 2(ii) is a weakened version of two more standard assumptions, i.e., the causal Markov and faithfulness assumptions in both source and target domains separately. Note that assuming these two assumptions instead of Assumption 2(ii) implies we cannot have perfect interventions in the target domain, which is otherwise allowed.
Proposition 2.
Assumption 2(ii) is implied by the following assumption:

the pooled source domains distribution is Markov and faithful to , and

the pooled target domains distribution is Markov and faithful to ,
where denotes the induced subgraph of the causal graph on the nodes (i.e., it is obtained by removing and all edges involving from the causal graph ).
a.2 Other proofs

First of all, implies (by definition) . Second, implies (by assumption) , and taken together, we get . By the Markov and faithfulness assumption (Assumption 2(i)), this holds iff . ∎

Figure 4: ADMGs for proof of Example 2. Each dashed edge can either be present or absent. In the JCI setting, we assume that in the full ADMG over variables , and are confounded and not caused by system variables . Furthermore, no pair of system variable and context variables is confounded.
In the context , if the conditional independences and hold, then we can also derive that , for example using Rule (9) from Magliacane et al. (2016). Moreover, we know that is not caused by and , or in other words and . Thus we conclude that is an LCD triple (Cooper, 1997) in the context . Since in addition, in this case and are unconfounded, the marginal ADMG on (in the context , and hence by Proposition 1 in all contexts) must be given by Figure 3(a).
Therefore, the extended marginal ADMG on variables must also have a directed path from to and from to . cannot be on these paths, as none of the variables causes , and therefore also contains the directed edges and . Moreover, cannot contain any edge between and , nor a bidirected edge between and , because that would violate the conditional independence. By construction, in the JCI setting there is a bidirected edge between and , and that is the only bidirected edge connecting to or . As we assumed there is no direct effect of on target , there is no edge between and in . There is also no directed edge in , as the JCI assumption implies none of the other variables causes . Therefore, the marginal ADMG is given by Figure 3(b), either with the directed edge present, or without that edge.
If it additionally holds that , we have two possibilities:

if holds, then is not caused by . This means it cannot be on any directed path from to , from to , or be a descendant of . Therefore the full ADMG also necessarily contains the directed edges and .

if holds, then in conjunction with we can derive , for example using Rule (5) from (Magliacane et al., 2016). This means must be a descendant of in the full ADMG , which implies it cannot be on the directed path from to , or on the one from to . Therefore the full ADMG also necessarily contains the directed edges and .
Because of the independence statements and JCI assumptions, there cannot be a bidirected edge between and , , or . Similarly, there cannot be directed edges from to one of those nodes. The edges and must also be absent.
In both cases, there can be a directed edge from to . Therefore, the full ADMG is of the form given in Figure 3(c). In all cases we see that , and we conclude that is a valid separating set.
If the ADMG is as in Figure 2, then a standard feature selection method would asymptotically prefer the subset to predict over the subset (note that the Markov blanket of in context is ). As a result, any prediction method trained on all available features using source domain data (i.e., in context ) may incur a possibly unbounded prediction error when used to predict in the target domain (for example, if is an almost deterministic copy of if , but has a drastically different distribution if ).∎

a.3 Additional results on synthetic data
We provide more information and experimental results for the synthetic data. We adapted the simulator of Hyttinen et al. (2014) to our setting. We generate randomly 200 acyclic models with three system variables, two context variables, and at most two latent variables (chosen randomly, so that the number of latent variables equals 1 or 2 each with probability , and 0 otherwise). Each latent variable has two system variables as children, while the other variables have a random number of system variables as children, where system variables must be consistent with a chosen topological ordering, and where we enforce that a context variable may not simultaneously affect all system variables. The system and latent variables are each described by a linear structural equation with independent noise terms distributed as . In these equations, each variable is multiplied by a coefficient sampled from or (each with probability per variable). The context variables each correspond to an experimental domain; in their domain, that variable equals 1, otherwise it equals 0. This way, we simulate soft interventions. In order to scale the effect of these interventions, we multiply the coefficients of the context variables by the parameter , varying it from 0.1 to 100. We sample data points each for the observational and two experimental domains. Moreover, we randomly select and from context and system variables respectively. We disallow direct effects of on .
As expected, our method performs well when the target distribution is significantly different from the source distributions. Figure 5
shows different settings with different scales of intervention effects. (In most graphs, the vertical axis has been adjusted to clearly show the boxplot, but leaving out the larger outliers.) In Figure
4(a) the intervention effects are all scaled by 0.1, resulting in very similar distributions in all domains. In this case, using our method does not offer any advantage with respect to the baseline and it actually performs worse. In the other cases, using our method starts to pay off in terms of prediction accuracy, and the difference increases with the scale of the interventions, as seen in Figure 4(d).In Figure 6, we vary the number of samples for each regime. The results improve with more samples, especially for our method, since the quality of the conditional independence test improves, but also for the baseline. In particular, as shown in Figure 5(a), the accuracy is low for samples, but it improves substantially with samples (Figure 4(b)).