1 Introduction
As machine learning systems are increasingly deployed in practice, system developers are being faced with deployment environments that systematically differ from the training environment. However, models are typically evaluated by splitting a single dataset into train and test subsets such that training and evaluation data are, by default, drawn from the same distribution. When evaluated beyond this initial dataset, say in the deployment environment, model performance may significantly deteriorate and potentially cause harm in safetycritical applications such as healthcare (see e.g.,
Schulam and Saria (2017); Zech et al. (2018)). Because access to deployment environment data may not be available during training, it is not always feasible to employ domain adaptation techniques to directly optimize the model for the target domain. This motivates the need for proactive approaches which anticipate and address the differences between training and deployment environments without using deployment data (Subbaswamy and Saria, 2018). As a step towards building more reliable systems, in this paper we address the problem of proactively training models that are robust to expected changes in environment.In order to ensure robustness, we must first be able to identify the sources of the changes. One way to do this is to reason about the differences in the underlying data generating processes (DGPs) that produce the data. For example, suppose we wish to diagnose a target condition , say lung cancer, using information about patient chest pain symptoms and whether or not they take aspirin . From our prior knowledge of the DGP we know that lung cancer leads to chest pain and that aspirin can relieve chest pain. We also know that smoking (unrecorded) is a risk factor for both lung cancer and heart disease, and aspirin is prescribed to smokers as a result. A diagnostic tool for this problem will be trained from one dataset before being deployed in hospitals that may not be represented in the data. Still, a modeler can reason about which aspects of the DGP are likely to differ across hospitals. For example, while the effects of lung cancer or aspirin on chest pain will not vary across hospitals, the policy used to prescribe aspirin to smokers (i.e., ) is practice dependent and will vary.
What can the modeler do after identifying potential sources of unreliability in the data? Because the modeler does not know which prescription policies will be in place at deployment locations or by how much the deployment DGP will differ from the training DGP, the modeler should design the system to be stable (i.e., invariant) to the differences in prescription policy. This means the model should predict using only the pieces of the DGP that are expected to stay the same across environments while not learning relationships that make use of the varying parts of the DGP. If the model predictions somehow depend on the prescription policy, then when the deployment policy strongly differs from the training policy model performance will significantly degrade and aspirintaking subpopulations will be systematically misclassified.
To ensure that a model does not make use of unreliable relationships in its predictions (i.e., relationships involving prescription policy), it helps to have a representation of the DGP that makes explicit our assumptions about the DGP and what we expect will vary. A natural representation is to use selection diagrams (Pearl and Bareinboim, 2011), which consist of two types of nodes: nodes representing variables relevant to the DGP (e.g., smoking or lung cancer ) and auxiliary selection variables (denoted by square nodes) which identify sources of unreliability in the DGP. For example, the selection diagram in Figure 1a represents the DGP underlying the diagnosis example, with the selection variable pointing to the piece of the DGP we expect to vary: the aspirin prescription policy . The selection variables point to mutable variables that are generated by mechanisms that are expected to differ across environments such that a selection diagram represents a family of DGPs which differ only in the mechanisms that generate the mutable variables.
Checking model stability graphically using selection diagrams is straightforward: a model is stable if its predictions are independent of the selection variables. If predictions are not independent of the selection variables then they depend on the environmentvarying mechanisms that do not generalize. To illustrate, suppose smoking status is not recorded in the data (denoted by the dashed edges in Figure 1a). A discriminative model of which conditions on all recorded features will be dependent on the selection variable: , which indicates this distribution is unstable (i.e., it differs by environment). One solution (see e.g., Subbaswamy and Saria (2018); Magliacane et al. (2018)), which we term graph pruning, is to perform feature selection to find a stable subset of features such that the conditional distribution transfers: . However, for the problem in Figure 1a, the only stable set is because by separation (Koller and Friedman, 2009) conditioning on either or activates the path , inducing a dependence between and . Further, in cases where the mechanism that generates varies across environments (i.e., the target shift scenario in which is a parent of ), no stable feature set exists. While without more assumptions or external data there is no stable discriminative model for such problems, in this paper we relax these limitations and propose an approach which can recover stable predictive relationships in cases where graph pruning fails.
The proposed solution, which we term the Graph Surgery estimator,^{1}^{1}1Henceforth graph surgery, surgery estimator, or surgery. is to directly remove any possible dependence on environmentvarying mechanisms by using interventional (Pearl, 2009) rather than observational distributions to predict. Specifically, we consider a hypothetical intervention in which for each individual the mutable variables are set to the values they were actually observed to be (i.e., in our example). Graphically, the intervention results in a mutilated graph (Figure 1b) in which the edges into , including edges from selection variables, are “surgically” removed (Pearl, 1998). The resulting interventional distribution^{2}^{2}2We will use and interchangeably.
is invariant to changes in how is generated, reflecting the “independence of cause and mechanism” (Peters et al., 2017), ensuring stability, and allowing us to use information about and that the graph pruning solution () does not. Graph surgery can be seen as learning a predictor from an alternate, hypothetical DGP in which the mutable variables were generated by direct assignment rather than by the environmentspecific mechanisms. This severs dependence on selection variables to yield a stable predictor. One challenge is that when the DAG contains hidden variables (as is common in reality), interventional distributions are not always uniquely expressible as a function of the observational training data (Pearl, 2009). To address this we use the previously derived ID algorithm (Tian and Pearl, 2002; Shpitser and Pearl, 2006b) for determining identifiability of interventional distributions.
Contributions: We propose graph surgery, an algorithm for estimating stable predictive models that can generalize even when train and test distributions differ. Graph surgery depends on a causal DAG to encode prior information about how the distribution of data might change. Given this prior information, it produces a predictor that does not depend on these unreliable parts of the data generating process. We show that graph surgery relaxes limiting assumptions made by existing methods for learning stable predictors. In addition, we connect the optimality of graph surgery to recently proposed adversarial distributional robustness problems.
2 Related Work
Differences between training and test distributions have been previously studied as the problem of dataset shift (QuiñoneroCandela et al., 2009). Many specific forms of dataset shift have been characterized by the dividing the variables into the input features and the target prediction outcome. Then, by reasoning about the causal relationship between the inputs and target, various forms of dataset shift can be defined (Storkey, 2009; Schölkopf et al., 2012) which has led to methods for tackling specific instances such as covariate shift (e.g., Sugiyama et al. (2007); Gretton et al. (2009)), target shift (Zhang et al., 2013; Lipton et al., 2018), conditional shift (Zhang et al., 2015; Gong et al., 2016), and policy shift (Schulam and Saria, 2017). By using selection diagrams, we can consider complex dataset shift scenarios beyond these two variabletype settings.
One issue is that methods for addressing dataset shift have mainly been reactive: they make use of unlabeled data from the target domain to reweight training data during learning and optimize the model specifically for the target domain (e.g., Storkey (2009); Gretton et al. (2009)). However, if we do not have target domain data available during learning, we must instead use proactive approaches in which the target domain remains unspecified (Subbaswamy and Saria, 2018).
One class of proactive solutions considers bounded distributional robustness. These methods assume that the possible test distributions are in some way centered around the training distribution. For example, in adversarial learning Sinha et al. (2018) consider a Wasserstein ball around the training distribution. Rothenhäusler et al. (2018) assume that differences between train and test distributions are bounded magnitude shift perturbations. However, these methods fail to give robustness guarantees on perturbations that are beyond the prespecified magnitude used during training. In safetycritical applications where correctness is crucial, we require unbounded invariance to perturbations which motivates the use of causalbased methods (Meinshausen, 2018).
To achieve stable models with complete invariance to perturbations, graph pruning methods consider a feature selection problem in which the goal is to find the optimal subset that makes the target independent from the selection variables. RojasCarulla et al. (2018) and Magliacane et al. (2018) accomplish this by empirically determining a stable conditioning set by hypothesis testing the stability of the set across multiple source domains and assuming that the target variable is not generated by a varying mechanism (no edge). Extending this, Subbaswamy and Saria (2018) consider also adding counterfactual variables to stable conditioning sets which allow the model to make use of more stable information than by using observed variables alone. However, this requires the strong parametric assumption that causal mechanisms are linear. By using interventional distributions rather than counterfactuals, graph surgery is able to relax this assumption and nonparametrically use more stable information than observational conditional distributions. Additionally, graph surgery allows for the target to be generated by a varying mechanism.
3 Methods
Our goal is to find a safe predictive distribution that generalizes even when train and test distributions differ. Derivation of the surgery estimator requires explicitly reasoning about the aspects of the DGP that can change and results in an interventional distribution in which the corresponding terms have been deleted from the factorization of the training distribution. In Section 3.1 we introduce requisite prior work on identifying interventional distributions before presenting the surgery estimator in Section 3.2 and establishing its soundness and completeness in Section 3.3.
3.1 Preliminaries
Notation: Throughout the paper sets of variables are denoted by bold capital letters while their particular assignments are denoted by bold lowercase letters. We will consider graphs with directed or bidirected edges (e.g., ). Acyclic will be taken to mean that there exists no purely directed cycle. The sets of parents, children, ancestors, and descendants in a graph will be denoted by , , , and , respectively. Our focus will be causal DAGs whose nodes can be partitioned into sets of observed variables, of unobserved variables, and of selection variables. and consist of variables in the DGP, while are auxiliary variables that denote mechanisms of the DGP that vary across domains.
Interventional Distributions: We now build up to the Identification (ID) algorithm (Tian and Pearl, 2002; Shpitser and Pearl, 2006b), a sound and complete algorithm (Shpitser and Pearl, 2006b) for determining whether or not an interventional distribution is identifiable, and if so, its form as a function of observational distributions. The ID algorithm operates on a special class of graphs known as acyclic directed mixed graph (ADMGs). Any hidden variable DAG can be converted to an ADMG by taking its latent projection onto (Verma and Pearl, 1991). In the latent projection of a DAG over observed variables , for there is an edge if there exists a directed path from to in where all internal nodes are unobserved, and if there exists a divergent path from to (e.g., ) in such that all internal nodes are unobserved. The bidirected edges represent unobserved confounding. Figure 1c shows the latent projection of the DAG in Figure 1
a. The joint distribution of an ADMG factorizes as:
(1) 
An intervention on sets these variables to constants . As constants, such that are deleted from (1) to yield the interventional distribution:
Graphically, the intervention results in the mutilated graph in which the edges into have been removed.^{3}^{3}3Similarly, will denote a mutilated graph in which edges out of are removed. When ADMG contains bidirected edges, interventional distributions are not always identifiable.
Definition 1 (Causal Identifiability).
For disjoint variable sets , the effect of an intervention on is said to be identifiable from in if is (uniquely) computable from in any causal model which induces .
The ID algorithm (a version of it is shown in Appendix A) determines if a particular interventional distribution is identified. Specifically, given disjoint variable sets and an ADMG , a function call to ID returns an expression (in terms of ) for if it is identified, otherwise it throws a failure exception. The ID algorithm is nonparametric, so the terms in the expression it returns can be learned from training data with arbitrary black box approaches.
In Shpitser and Pearl (2006a), the ID algorithm was extended to answer conditional effect queries of the form for disjoint sets by showing that every conditional ID query can be reduced to an unconditional ID query using the procedure shown in Algorithm 1. This procedure finds the maximal subset of variables in the conditioning set to bring into the intervention set using Rule 2 of calculus (action/observation exchange) (Pearl, 2009, Chapter 3). The resulting conditional interventional distribution is then proportional to the joint distribution of and the remaining variables in the conditioning set. A call to ID can then determine the identifiability of the resulting unconditional query.
Transportability: Transportability is a framework for the synthesis of experimental and observational data from multiple domains to answer a statistical or causal query in a prespecified target domain (Pearl and Bareinboim, 2011; Bareinboim and Pearl, 2012). In order to build safe and reliable models, we restrict our attention to learning predictive models that can be directly transported from the source domain to an unspecified target domain without any adjustment.
Definition 2 (Selection diagram).
A selection diagram is a causal DAG or ADMG augmented with auxiliary selection variables (denoted by square nodes) such that for an edge denotes the causal mechanism that generates may vary arbitrarily in different domains. Selection variables may have at most one child.
We refer to the children of as mutable variables. Selection diagrams define a family of distributions over domains such that in (1) can differ arbitrarily in each domain. Constructing a selection diagram generally requires domain knowledge to specify the mechanisms and the placement of selection variables. Without prior knowledge causal discovery methods can potentially be used (Spirtes et al., 2000).
We now define stability as a predictive analog of direct transportability (Pearl and Bareinboim, 2011), in which a source domain relationship holds in the target domain without adjustment.
Definition 3 (Stable estimator).
An estimator for predicting a variable is said to be stable if it is independent of all .
Graph pruning and graph surgery can both produce stable estimators, but pruning estimators will always be observational conditional distributions while surgery estimators will be the identified form of an interventional distribution.
3.2 The Graph Surgery Estimator
Graph surgery assumes the data modeler has constructed or been given a causal DAG of the DGP with target prediction variable , observed variables , and unobserved variables that has been augmented with selection variables using prior knowledge about mechanisms that are expected to differ across domains (e.g., prescription policy). An overview of the procedure is as follows: The selection DAG is converted to a selection ADMG so it is compatible with the ID algorithm. Children of in the selection ADMG form the set of mutable variables . The proposed algorithm then searches all possible interventional distributions (which intervene on ) for the optimal (with respect to heldout source domain data) identifiable distribution, which is normalized and returned as the surgery estimator. We now cover each step in detail.
Only observed variables can be intervened on, so to determine , we take the latent projection of the selection DAG to turn it into an ADMG . If a selection variable has multiple children in , then should be split into multiple selection variables, one per child, with the new selection variables added to . Any disconnected variables in can be removed. The mutable variables are then given by . We now establish that intervening on (at least) results in a stable estimator.
Proposition 1.
For , such that , the interventional distribution is stable.
Proof.
The intervention results in the graph in which all edges into are removed. Since and , the intervention removes all edges out of . This means is disconnected (and thus separated) from in which gives us stability. ∎
What interventional distribution should we use to predict ? A natural idea is to use the full conditional interventional distribution which can be turned into a corresponding unconditional query (so we can call ID) using a call to Algorithm 1: UQ. However, this has two issues. First, if the target variable is mutable itself () then the conditional interventional distribution is illdefined since the three variable sets must be disjoint. If is mutable, then we must intervene on it, graphically represented by deleting all edges into . Variables related to through edges out of (e.g., children and their bidirected neighborhoods) can still be used to predict . Thus, if , we can generate an unconditional query of the form from UQ, noting that we are using the mutilated graph . We must further modify the result to account for the fact that we are also intervening on : . Importantly, there is never a stable pruning estimator when which shows that graph surgery can provide stability in cases where existing pruning solutions cannot.
Second, the full conditional interventional distribution may not be identifiable. We propose an exhaustive search over possible conditioning sets: trying each for , where denotes the power set. In the interest of identifiability, even if we may want to consider intervening on .^{4}^{4}4Deleting edges in a graph generally helps identifiability (Pearl, 2009). For example, in Figure 2(a), and are not identifiable, but is. Thus, we should consider the unconditional query returned by Algorithm 1 in both and (with the modification of moving to the intervention set). The full procedure is given as Algorithm 2
. Note that it returns the estimator that performs the best on held out sourcedomain validation data with respect to some loss function
. If there is no identifiable interventional distribution, the Algorithm throws a failure exception.3.3 Soundness and Completeness
We first show the soundness of Algorithm 2.
Theorem 1 (Soundness).
When Algorithm 2 returns an estimator, the estimator is stable.
Proof.
Any query Algorithm 2 makes to ID considers intervening on a superset of the mutable variables . By Proposition 1 this means the target interventional distribution is stable. From the soundness of the ID algorithm (Shpitser and Pearl, 2006b, Theorem 5), the resulting functional of observational distributions that Algorithm 2 returns will be stable. ∎
Completeness follows from the exhaustive nature of the Algorithm.
Theorem 2 (Completeness).
If Algorithm 2 fails, then there exists no stable surgery estimator for predicting .
Proof.
Algorithm 2 is an exhaustive search over interventional distributions that intervene on supersets of and are functions of . Thus, if there is a stable surgery estimator, the procedure will find one. ∎
4 Connections with Existing Approaches
We establish connections between graph surgery and existing proactive approaches, showing that graph pruning (which finds stable conditional relationships) is a special case of surgery and that surgery has an optimal distributionally robust interpretation.
4.1 Relationship with Graph Pruning
We show that graph pruning estimators are in fact surgery estimators, so graph surgery does not fail on problems graph pruning can solve.
Lemma 1.
Let be the target variable of prediction and be a selection ADMG with selection variables . If there exists a stable conditioning set such that , then Algorithm 2 will not fail on input .
Proof.
Assume that is a stable graph pruning estimator. Partition into and such that and , and let . It must be that in . If this were not the case then there would be some such that there was a backdoor path from to , and since there is a path . Because is conditioned upon, this collider path would be active and , implying is not stable (a contradiction). Now by Rule 2 of docalculus, . Next consider the remaining mutable variables . Letting denote the subset of nodes that are not ancestors of any nodes in , we will show that in . First consider . For the independence to not hold, there must be an active forward path from to . But because , the path is active since is not conditioned upon, implying contradictorily that was not stable. Now consider . For the independence to not hold, either there is an active forward path from to , or there is an active backdoor path from to . We previously showed the first case. In the second case, because is an ancestor of some that is conditioned upon, the collider path is active, so is not stable (contradiction). Thus, by Rule 3 of docalculus, we have that . This is one of the conditional interventional queries that Algorithm 2 considers, so Algorithm 2 will not fail. ∎
In the proof of Lemma 1 we derived that graph pruning is a special case of graph surgery:
Corollary 1.
Graph pruning estimators are graph surgery estimators since they can be expressed as conditional interventional distributions.
Lemma 2.
There exists a problem for which graph pruning cannot find a nonempty stable conditioning set but for which graph surgery does not fail.
Proof.
As one such example, see Figure 1(c). ∎
From the previous two Lemmas the following Corollary is immediate:
Corollary 2.
There exists a stable graph surgery estimator for a strict superset of the problems for which there exists a stable graph pruning estimator.
We have now shown that graph surgery strictly generalizes graph pruning.
4.2 Surgery As Distributional Robustness
We now show the surgery estimator is optimal under a robust Bayes (Berger, 1985) decision theoretic view of generalization in the presence of unstable mechanisms.
Suppose selection ADMG defines a prediction problem with target label and input features . For simplicity, suppose all variables are discrete with finite sample spaces such that the prediction problem is classification. Also let denote a hypothesis which maps the domain of to the domain of and denote some bounded loss function (e.g.,  loss or Brier score). Under classical assumptions that training and test distributions are the same, the quantity of interest is the expected loss or risk: . For classification problems we define the Bayes optimal predictor to be the predictor that picks the label that maximizes the true
. The risk of the Bayes predictor is a tight lower bound on the performance of any classifier
(ShalevShwartz and BenDavid, 2014).In our setting, however, varies across domains. Recall that a selection diagram defines a family of distributions over such that for any particular domain (i.e., setting of ) there exists a such that factorizes according to (1) and members of differ in . Now consider the game in which a data modeler (DM) is interested in picking the with corresponding Bayes predictor that achieves the minimum worstcase risk across all domains (i.e., ). This game can be written as
(2) 
with the goal of the DM being to pick the that achieves the infimum: the minimax optimal act. In this style of decision problem, known as robust Bayes or minimax (Berger, 1985), we minimize the risk taken with respect to the worst case prior .
Lemma 3.
For finite discrete and , suppose is identified. Then is proportional to a version of in in which have been replaced by their maximum entropy distributions.
Proof.
For any , we know , and that defines a proper distribution (sums to 1) over . Without loss of generality, suppose each can take one of values. Then the interventional distribution is illdefined over because . However, we can make this proper by normalizing it such that . Thus, is within a constant factor from in with maximum entropy distributions of . ∎
The significance of this is that because interventional distributions are proportional to their maximum entropy counterparts (with respect to for ), the Bayes optimal predictor for interventional and maximum entropy distributions will be the same.
Theorem 3.
For finite discrete variables and bounded loss ,
the maximum entropy distribution attains this supremum, and achieves (2).
Proof.
Follows directly from (Grünwald et al., 2004, Theorem 6.1). ∎
Finally, from Lemma 3 we have the following:
Corollary 3.
The Bayes predictor of the interventional distribution is equal to the the Bayes predictor of the maximum entropy distribution that achieves (2).
We have shown that Bayes predictors for interventional distributions provide tight lower bounds on the worstcase generalization error of the transfer learning problem defined by a selection ADMG in which the intervened variables correspond to
. Absent identification (and sampling) issues, this means the minimax performance of the full conditional surgery estimator is a tight lower bound on the worstcase performance of any predictor.5 Experiments
We evaluate the graph surgery estimator in proactive transfer settings in which data from the target distribution is unavailable. The goal of our experiments is to demonstrate that the surgery estimator is stable in situations in which existing methods are either not applicable or suboptimal. To this end, we first consider a simulated experiment for which the true selection diagram is known. Then we apply the surgery estimator to real data to demonstrate its practical utility even when the true selection diagram is unknown. We compare against a naive pooled ordinary least squares baseline (OLS) and causal transfer learning (CT), a stateoftheart pruning approach
(RojasCarulla et al., 2018).^{5}^{5}5https://github.com/mrojascarulla/causal_transfer_learning If CT fails to find a nonempty subset, we predict using the pooled source data mean. On real data we also compare against Anchor Regression (AR), a distributionally robust method for bounded magnitude shift interventions (Rothenhäusler et al., 2018) which requires a causal “anchor” variable with no parents in the graph. All performance is measured using mean squared error (MSE).5.1 Simulated Data
We simulate data from zeromean linear Gaussian systems using the DAG in Figure 1(a) considering two variations (full details in Appendix B).^{6}^{6}6This DAG contains no anchor so we cannot apply AR. The first considers the selection problem in Figure 1(a) in which is a mutable variable, defining a family of DGPs which vary in the coefficient of in the structural equation for . We generate 10 source domains and apply on test domains in which we vary the coefficient on a grid. Recall that in this DAG the empty set is the only stable conditioning set and CT should model . While this is stable, we expect the performance to be worse than that of the surgery estimator: which is able to use additional stable factors.
The MSE as we vary the test coefficient of is shown in Figure 3a. As expected, the stable models CT and Surgery are able to generalize beyond the training domains (vertical dashed lines), while the unstable OLS loss grows quickly. However, for small deviations from the training domain OLS outperforms the stable methods which shows that there is a tradeoff between stability and performance in and near the training domain. The gap between CT and Surgery is expected since Surgery models an extra stable, informative factor: .
We repeat this experiment but consider the target shift scenario in which is the mutable variable, and the DGPs across domains differ in the coefficient of in the structural equation for . Now there is no stable conditioning set which violates the assumption of CT. Again, CT used the empty conditioning set but in this case is unstable so the loss grows quickly in Figure 3b. As before, OLS is unstable but performs best near the source domains. The surgery estimator is stable and the loss appears constant compared to the unstable alternatives. These experiments demonstrate that stability is an important property when differences in mechanisms can be arbitrarily large. In the Appendix we aggregate results for many repetitions of the simulations.
5.2 Real Data: Bike Rentals
Following Rothenhäusler et al. (2018) we use the UCI Bike Sharing dataset (FanaeeT and Gama, 2013; Dheeru and Karra Taniskidou, 2017) in which the goal is to predict the number of hourly bike rentals from weather data including temperature , feeling temperature , wind speed , and humidity . As in Rothenhäusler et al. (2018), we transform from a count to continuous variable using the square root. The data contains 17,379 examples with temporal information such as season and year. We partition the data by season (14) and year (12) to create domains with different mechanisms. We posit the causal diagram in Figure 2(b) with confounding caused by unobserved temporal factors, and hypothesize that differences in season result in unstable mechanisms for weather: the mutable variables are . If this diagram is true, then no stable pruning estimator exists, so we expect surgery to outperform CT and OLS if the differences in mechanisms are large. The full conditional interventional distribution is identified and the surgery estimator is given by . We posit linear Gaussians for each term and compute using 10,000 Monte Carlo samples. Since AR and CT require data from multiple source domains, for each year (Y), we select one season as the target domain, using the other three seasons as source domains. Since OLS and Surgery do not make use of the season indicator, we simply pool the data for these methods.
We sample of the training/test data 20 times and report the average MSE in Table 1
(intervals are one standard error). The surgery estimator performs competitively, achieving the lowest average MSE in 3 of 8 test cases. When the OLS MSE is high (seasons 3 and 4 in each year), Surgery tends to outperform it which we attribute to Surgery’s stability. We also see that CT tends to perform poorly which lends some credibility to our hypothesized selection diagram which dictates that no stable pruning estimator exists. AR’s very good performance is expected, since the shiftperturbation assumption seems reasonable in this problem. However, AR requires tuning of a hyperparameter for the maximum magnitude shift perturbation to protect against which is less preferable than stable estimators such as surgery when the target domain is unknown and very different from the source.
Test Data  OLS  AR  CT  Surgery 
(Y1) Season 1  20.80.10  20.50.10  42.22.04  20.70.36 
Season 2  23.20.05  23.20.05  29.90.09  23.80.09 
Season 3  32.20.14  31.40.13  32.20.14  29.90.26 
Season 4  29.20.08  29.10.08  29.10.08  28.20.07 
(Y2) Season 1  32.50.11  32.20.11  32.60.15  36.10.37 
Season 2  39.30.11  39.20.11  46.10.12  39.50.13 
Season 3  47.70.17  46.70.16  48.20.22  54.80.73 
Season 4  46.20.16  46.00.16  46.10.16  44.40.16 
6 Conclusion
Since the very act of deploying a system can result in shifts that bias the system in practice (e.g., Lum and Isaac (2016)), machine learning practitioners need to become increasingly aware of how deployment and training environments can differ. To this end, we have introduced a framework for identifying and expressing desired invariances to changes in the DGP, and the Graph Surgery estimator as an approach for learning a model stable to such changes. The graph surgery estimator finds a stable and identifiable interventional distribution which is expressible as a function of the training data and can be fit using arbitrarily complex models. Further, the interventional distributions are strictly more applicable than the conditional distributions used by existing graph pruning approaches and are optimal from a distributionally robust perspective. In future work we wish to consider methods for when the selection diagram does not entail any identifiable stable predictors. In particular, some form of sensitivity analysis for dealing with uncertainty in the DGP such as infusing boundedmagnitude distributional robustness with prior knowledge of the DGP seems promising.
References
 Bareinboim and Pearl (2012) Bareinboim, E. and Pearl, J. (2012). Transportability of causal effects: Completeness results. In AAAI, pages 698–704.
 Berger (1985) Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis. Springer Science & Business Media.
 Dheeru and Karra Taniskidou (2017) Dheeru, D. and Karra Taniskidou, E. (2017). UCI machine learning repository.

FanaeeT and Gama (2013)
FanaeeT, H. and Gama, J. (2013).
Event labeling combining ensemble detectors and background knowledge.
Progress in Artificial Intelligence
, pages 1–15.  Gong et al. (2016) Gong, M., Zhang, K., Liu, T., Tao, D., Glymour, C., and Schölkopf, B. (2016). Domain adaptation with conditional transferable components. In International Conference on Machine Learning, pages 2839–2848.
 Gretton et al. (2009) Gretton, A., Smola, A. J., Huang, J., Schmittfull, M., Borgwardt, K. M., and Schölkopf, B. (2009). Covariate shift by kernel mean matching. In QuiñoneroCandela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. D., editors, Dataset shift in machine learning, chapter 2, pages 131–160. The MIT Press.
 Grünwald et al. (2004) Grünwald, P. D., Dawid, A. P., et al. (2004). Game theory, maximum entropy, minimum discrepancy and robust bayesian decision theory. the Annals of Statistics, 32(4):1367–1433.
 Jaber et al. (2018) Jaber, A., Zhang, J., and Bareinboim, E. (2018). Causal identification under markov equivalence. In Uncertainty in Artificial Intelligence.
 Koller and Friedman (2009) Koller, D. and Friedman, N. (2009). Probabilistic graphical models: principles and techniques. MIT press.
 Lipton et al. (2018) Lipton, Z. C., Wang, Y.X., and Smola, A. (2018). Detecting and correcting for label shift with black box predictors. In International Conference on Machine Learning.
 Lum and Isaac (2016) Lum, K. and Isaac, W. (2016). To predict and serve? Significance, 13(5):14–19.
 Magliacane et al. (2018) Magliacane, S., van Ommen, T., Claassen, T., Bongers, S., Versteeg, P., and Mooij, J. M. (2018). Domain adaptation by using causal inference to predict invariant conditional distributions. In Proceedings of the ThirtySecond Conference on Neural Information Processing Systems.

Meinshausen (2018)
Meinshausen, N. (2018).
Causality from a distributional robustness point of view.
In
2018 IEEE Data Science Workshop (DSW)
, pages 6–10. IEEE.  Pearl (1998) Pearl, J. (1998). Graphical models for probabilistic and causal reasoning. In Quantified representation of uncertainty and imprecision, pages 367–389. Springer.
 Pearl (2009) Pearl, J. (2009). Causality. Cambridge university press.
 Pearl and Bareinboim (2011) Pearl, J. and Bareinboim, E. (2011). Transportability of causal and statistical relations: a formal approach. In Proceedings of the TwentyFifth AAAI Conference on Artificial Intelligence, pages 247–254. AAAI Press.
 Peters et al. (2017) Peters, J., Janzing, D., and Schölkopf, B. (2017). Elements of causal inference: foundations and learning algorithms. MIT press.
 QuiñoneroCandela et al. (2009) QuiñoneroCandela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. D. (2009). Dataset shift in machine learning.
 RojasCarulla et al. (2018) RojasCarulla, M., Schölkopf, B., Turner, R., and Peters, J. (2018). Invariant models for causal transfer learning. Journal of Machine Learning Research, 19(36).
 Rothenhäusler et al. (2018) Rothenhäusler, D., Bühlmann, P., Meinshausen, N., and Peters, J. (2018). Anchor regression: heterogeneous data meets causality. arXiv preprint arXiv:1801.06229.
 Schölkopf et al. (2012) Schölkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., and Mooij, J. (2012). On causal and anticausal learning. In International Conference on Machine Learning, pages 459–466.
 Schulam and Saria (2017) Schulam, P. and Saria, S. (2017). Reliable decision support using counterfactual models. In Advances in Neural Information Processing Systems, pages 1697–1708.
 ShalevShwartz and BenDavid (2014) ShalevShwartz, S. and BenDavid, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge university press.
 Shpitser and Pearl (2006a) Shpitser, I. and Pearl, J. (2006a). Identification of conditional interventional distributions. In 22nd Conference on Uncertainty in Artificial Intelligence, UAI 2006, pages 437–444.
 Shpitser and Pearl (2006b) Shpitser, I. and Pearl, J. (2006b). Identification of joint interventional distributions in recursive semimarkovian causal models. In Proceedings of the National Conference on Artificial Intelligence, volume 21, page 1219.
 Sinha et al. (2018) Sinha, A., Namkoong, H., and Duchi, J. (2018). Certifying some distributional robustness with principled adversarial training. In ICLR.
 Spirtes et al. (2000) Spirtes, P., Glymour, C. N., Scheines, R., Heckerman, D., Meek, C., Cooper, G., and Richardson, T. (2000). Causation, prediction, and search. MIT press.
 Storkey (2009) Storkey, A. (2009). When training and test sets are different: characterizing learning transfer. Dataset shift in machine learning, pages 3–28.
 Subbaswamy and Saria (2018) Subbaswamy, A. and Saria, S. (2018). Counterfactual normalization: Proactively addressing dataset shift and improving reliability using causal mechanisms. In Uncertainty in Artificial Intelligence.
 Sugiyama et al. (2007) Sugiyama, M., Krauledat, M., and MÃžller, K.R. (2007). Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8(May):985–1005.
 Tian and Pearl (2002) Tian, J. and Pearl, J. (2002). A general identification condition for causal effects. In AAAI.
 Verma and Pearl (1991) Verma, T. and Pearl, J. (1991). Equivalence and synthesis of causal models. In Proceedings of Sixth Conference on Uncertainty in Artificial Intelligence, pages 220–227.

Zech et al. (2018)
Zech, J. R., Badgeley, M. A., Liu, M., Costa, A. B., Titano, J. J., and
Oermann, E. K. (2018).
Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A crosssectional study.
PLoS medicine, 15(11):e1002683.  Zhang et al. (2015) Zhang, K., Gong, M., and Schölkopf, B. (2015). Multisource domain adaptation: A causal view. In AAAI, pages 3150–3157.
 Zhang et al. (2013) Zhang, K., Schölkopf, B., Muandet, K., and Wang, Z. (2013). Domain adaptation under target and conditional shift. In International Conference on Machine Learning, pages 819–827.
Appendix A ID Algorithm
We now restate the identification algorithm (ID) (Tian and Pearl, 2002; Shpitser and Pearl, 2006b) using the modified presentation in Jaber et al. (2018). When the interventional distribution of a set of variables is identified, the ID algorithm returns it in terms of observational distributions (i.e., if the intervention is represented using notation, then the resulting expression contains no terms). The ID algorithm is complete (Shpitser and Pearl, 2006b), so if the interventional distribution is not identifiable, then the algorithm throws a failure exception. Note that denotes an induced subgraph which consists of only the variables in and the edges between variables in .
We will need the following definition:
Definition 4 (Ccomponent).
In an ADMG, a ccomponent consists of a maximal subset of observed variables that are connected to each other through bidirected paths. A vertex with no incoming bidirected edges forms its own ccomponent.
We also restate the following Corollary (Jaber et al., 2018, Corollary 1):
Corollary 4.
Given an ADMG with observed variables and unobserved variables , , and , if is not in the same ccomponent with a child of in , then is identifiable and is given by
where denotes the ccomponent of in the induced subgraph .
This Corollary allows us to derive the postintervention distribution after intervening on from the postintervention distribution after intervening on the variables in . The modified presentation of Tian’s ID algorithm given in Jaber et al. (2018) is in Algorithm 3, which computes the identifying functional for the postinterventional distribution of the variables in after intervening on the variables in by recursively finding the identifying functional for each ccomponent in the postintervention subgraph.
Appendix B Experiment Details
b.1 Hyperparameters for Baselines
Causal transfer learning (CT) has hyperparameters dictating how much data to use for validation, the significance level, and which hypothesis test to use. In all experiments we set valid split , delta=0.05, and use hsic = False (using HSIC did not improve performance and was much slower).
Anchor regression requires an “anchor” variable. In the real data experiment we use season as the anchor. It also has a hyperparameter which dictates the magnitude of perturbation shifts it protects against. We set this to twice the maximum standard deviation of any variable in the training data (including the target).
b.2 Simulated Experiment
We generate data from linear Gaussian structural equation models (SEMs) defined by the DAG in Figure 1a:
We generate the coefficients and take .
In simulated experiment 1, is the mutable variable so across source and target domains we vary the value of . Similarly, in experiment 2 (target shift) is the mutable variable so we vary the value of .
We perform both experiments as follows: In each domain we sample 1000 examples. We generate coefficients , and take 1000 samples. This is used as the training data for Graph Surgery. Then we generate 1000 samples for each of 9 other randomly generated values of or for experiments 1 and 2, respectively. The 10,000 total samples from 10 domains are used to train the OLS and CT baselines. Then we evaluate on 1000 samples from each of 100 test domains. The (or ) values are taken from an equally spaced grid. For experiment 1 we consider in while for experiment 2 we consider . This process is repeated 500 times to yield results on 50,000 test domains.
The boxplot of the test domain MSEs across the 50,000 test domains for Experiment 1 is shown in Figure 4. In this example, Surgery is the only consistently stable model. CT is stable when it selects the empty conditioning set, but in
of the 500 runs CT picks all features (i.e., it is equivalent to OLS). We see that the two (at least sometimes) stable methods have much lower variance in performance. Thus, stability implies less variance across domains which is desirable in the proactive transfer setting.
The boxplot of the test domain MSEs across the 50,000 test domains for Experiment 2 is shown in Figure 5. In this example, Surgery is the only consistently stable model. CT has no stable conditioning set. In of runs CT conditioned on all features. The other times it tended to use the empty set. However, in this experiment is not stable and uses less information than (which OLS models) which is what causes it to have worse performance than OLS. Thus, even in the challenging target shift scenario, graph surgery allows us to estimate a stable model when no stable pruning or conditional model exists.