1 Introduction
Understanding the causal relationships in a system or application of interest is perhaps the most desirable goal in terms of understanding and interpretability. There is a rich history of developments from various disciplines, dating back to ancient times: “Felix, qui potuit rerum cognoscere causas” – Fortunate who was able to know the causes of things (Georgics, Virgil, 29 BC). One might think that for pure prediction tasks, without any ambition of interpretability, knowing the causes or the causal structure is not important. We will explain here how these problems are related and as a consequence: (i) one can obtain “better” predictions when incorporating causal aspects and (ii) one can infer causal structure from a certain predictive perspective.
Inferring causal structure and effects from data is a rapidly growing area. When having access to data from fully randomized studies, Jerzy Neyman made a pioneering contribution using a potential outcome model (SplawaNeyman, 1990).
Randomized studies serve as the gold standard and the corresponding inference of causal effects can be viewed as “confirmatory” due to the fact that the underlying model assumptions are not substantially more restrictive than for say a standard regression type problem, see for example Dawid (2000); Pearl (2009); Hernán and Robins (2010); Rubin and Imbens (2015); VanderWeele (2015). Often though, the data at hand does not come from a (fully) randomized study: the question is now whether one can still infer causal effects and under what kind of assumptions this is possible. A range of different approaches have been suggested, see for example Greenland et al. (1999); Robins et al. (2000); Spirtes et al. (2000); Richardson et al. (2002); Hernán and Robins (2006); Tchetgen and VanderWeele (2012); Chickering (2002); Kalisch and Bühlmann (2007); Maathuis et al. (2009); Hauser and Bühlmann (2015), exhibiting different degrees of “confirmatory” nature for inferring causal effects. Since causal inference is very ambitious, these techniques should be thought as “geared towards causality” but not necessarily able to infer the underlying true causal effects. Still, the point is that they do something “more intelligent towards causality” than an analysis based on a standard potentially nonlinear regression or classification framework. We believe that this is an important area in statistics and machine learning: in particular, these techniques often have a more “causaltype” and thus more interesting interpretation than standard machine learning methods and therefore, this topic is important in the advent of “interpretable machine learning”.
1.1 A framework based on invariance properties
We will focus here on a particular framework with corresponding methods which are “geared towards” causal solutions: with stronger assumptions (but less strong than for some competitor methods) they infer causal effects while under more relaxed and perhaps more realistic assumptions, they are still providing solutions for a “diluted form of causality”^{3}^{3}3I am grateful to Ed George who suggested this term. which are often more meaningful than what is provided by regression or classification techniques. This can be made mathematically more rigorous, in terms of a novel form of robustness.
The construction of methods is relies on exploiting invariance from heterogeneous data. The heterogeneity can be unspecific perturbations and in this sense, the current work adds to the still yet quite small literature on statistics for perturbation data.
1.2 Our contribution
2 Predicting potential outcomes, heterogeneity and worst case risk optimization
Predicting potential outcomes is a relevant problem in many application areas. Causality deals with a quantitative answer (a prediction) to a “What if I do question” or a “What if I perturb question”.
2.1 Two examples for prediction of potential outcomes
The first example is from genomics (Stekhoven et al., 2012)
. The response variable of interest is the flowering time of the
Arabidopsis thaliana plant (the time it takes until the plant is flowering) and the covariates are gene expressions from 21’326 genes, that is from a large part of the genome. The problem is to predict the flowering time of the plant when making single gene interventions, that is, when single genes are “knocked out”. The data is from the observational state of the system only without any interventions. Therefore, this is a problem of predicting a potential outcome which has never been observed in the data. Even if one fails to infer the true underlying causal effects when making interventions, our viewpoint is that a prediction being better than from a statetothe art regression method is still very useful for e.g. prioritizing experiments which can be done subsequently in a biology lab, see for example Maathuis et al. (2010) and the Editorial (2010).The second example is about predicting behavior of individuals when being treated by advertisement campaigns. Such an advertisement could happen on social media for political campaigns or various commercial products. Consider the latter, namely an advertisement for commercial products on social media. The response of interest is how deep an individual user clicks on the advertisement and the subsequent webpages, the covariates are attributes of the user. The task is to predict the response if one would intervene and show to a certain user “X” a certain advertisement “A”: but there is no data for user “X” or similar users as “X” being exposed (or treated) with advertisement “A”. Thus, it is a problem of predicting a potential outcome which has never been observed in the data. As mentioned in the genomic example above, even if we cannot infer the underlying true causal effect of treatment with an advertisement, it is still informative and valuable to come up with a good prediction for the response under an intervention which we have never seen in the data. See for example Bottou et al. (2013) or also Brodersen et al. (2015).
2.2 The heterogeneous setting with different environments
We consider data from different observed (known) environments, and we sometimes refer to them also as experimental settings or subpopulations or perturbations:
(1) 
with
response vectors
, covariate design matrices and denotes an environment from the space of observed environments. Here, denotes the sample size in environment . We assume that the samples in environmentare i.i.d. realizations of a univariate random variable
and a dimensional random vector .Examples. As a first example, consider data from 10 different countries where we know the correspondence of each data point to one of the 10 countries. Then the space of observed (known) environments can be encoded by the labels from . As a second example, consider economical data which is collected over time. Different environments or subpopulations then correspond to different blocks of consecutive time points. Assuming that these blocks are known and given, the space then contains the labels for these different blocks of subpopulations.
Heterogeneity can also occur outside the observed data. Thus, we consider a space of unobserved environments
(2) 
which is typically much larger than the space of observed environments .
Examples (cont.). In the examples from above, the space could be: the 10 countries which are observed in the data and all other countries in the world; or the subpopulations of economical scenarios which we have observed in the data until today and all subpopulations of future scenarios which we have not seen in the data.
A main task is to make predictions for new unseen environments as discussed next.
2.3 A prediction problem and worst case risk optimization
We consider the following prediction problem. [linewidth=0.5mm] Predict given such that the prediction “works well” or is “robust” for all based on data from much fewer environments . Note that is nonobserved. The meaning of the aim above is that one is given in the future new covariates from and the goal is to predict the corresponding
. The terminology “works well” or is “robust” is understood here in the sense of performing well in worstcase scenarios. We note that the problem above is also related to transfer learning
(Pratt, 1993; Pan and Yang, 2010; RojasCarulla et al., 2018).In a linear model setting, this prediction task exhibits a relation to the following worst case risk optimization:
(3) 
This problem has an interesting connection to causality. Before giving a more rigorous formulation, we describe the connection in a more loose sense, for the purpose of easier understanding. We consider the class which includes all heterogeneities or perturbations fulfilling two main assumptions:
 adhoc condition 1:

does not act directly on .
 adhoc condition 2:

does not change the mechanism between and .
 adhoc aim:

ideally, should change the distribution of .
The adhoc conditions 1 and 2 are formulated precisely in assumption (B) in Section 3.1. Regarding the adhoc aim: if there are many which change the distribution of , this introduces more observed heterogeneities in which in term is favorable for better identification of causal effects.
Figure 2 is a graphical illustration of the adhoc conditions and aim above. For this purpose, we may think that the environments are generated from a random variable . We remark that there could be also hidden confounding variables between and : more details are given in Section 4.
An interesting connection to causality is then as follows:
(4) 
where . The definition of the causal parameter and the precise description of the result is given later in Section 3.1. The point here is to emphasize that the causal parameter or the causal solution is optimizing a certain worst case risk. This opens the door to think about causality in terms of optimizing a certain (worst case) risk. We believe that this is a very useful way which might ease some of the more complicated issues on structure search for causal graphs and structural equation models.
3 Invariance of conditional distributions
A key assumption for inferring causality from heterogeneous data as in (1) is an invariance assumption. It reads as follows:
 (A):

There exists a subset of the covariate indices (including the empty set) such that
That is, when conditioning on the covariates from (denoted by ), the conditional distribution is invariant across all environments from .
 (A):

Analogous but now for the much larger set of environments .
In a linear model setting, the invariance assumption translates as follows. There exists a subset and a regression coefficient vector with such that:
where denotes the same distribution for all . That is, when conditioning (regressing) on , the resulting regression parameter and error term distribution are the same for all environments . From a practical point of view, this is an interesting invariance or stability property and the set of covariates plays a key component in “stabilizing across the environments”, see also Figure 12 in Section 6.
If the invariance assumption holds, we are sometimes interested in describing the sets which fulfill invariance. We then denote by
 (A):

The subset fulfills invariance saying that
 (A):

analogous but now for the set of environments .
When considering the invariance assumption for the set of unknown (future) environments , the sets for which (A) holds are particularly interesting as they lead to invariance and stability for new, future environments which are not observed in the data. This is a key for solving worst case risk optimization with respect to a class of perturbations which can be arbitrarily strong as in (4).
3.1 Invariance and causality
A main question is whether there are sets for which (A) holds and if so, whether there are many such sets and how one can describe them. Obviously, this depends on and the problem then becomes as follows: under what model can we have an interesting description of sets which satisfy the invariance assumption (A).
To address this at least in part, we consider structural equation models (SEMs):
(5) 
where denotes the parental set of the response variable in the corresponding causal influence diagram, and the distribution of
can be arbitrary but assuming finite second moments and positive definite covariance matrix. Often in the literature, a SEM is considered for all the variables:
(6) 
with mutually independent. The model in (3.1) is a special case of the SEM in (3.1), the former now assuming a structural equation part for the variables. Furthermore, the formulation in (3.1) also allows other hidden variables which may act on but do not have a confounding effect on . The case of hidden confounders will be discussed later in Section 4.
The (direct) causal variables for are defined to be
The environments or perturbations change the distributions of and in model (3.1) and we denote the corresponding random variables by and . The adhoc conditions 1 and 2 from Sections 2.3 are now formulated as follows:
 (B)

The structural equation in (3.1) remains the same, that is for all
 (B)

analogous but now for the set of environments .
We note that the distributions of are allowed to change.
The following simple result describes the special role of causality with respect to invariance.
Proposition 1.
Assume a partial structural equation model as in (3.1). Consider the set of environments such that (B) holds. Then, the set of causal variables satisfies the invariance assumption with respect to , that is (A) holds.
The proof is trivial. The conditional distribution of given is given by and the distribution of , and these quantities do not depend on .
In presence of hidden confounder variables, invariance and causal structures can still be linked under certain assumptions: this will be discussed in Section 4. Proposition 1 says that causal variables lead to invariance: this has been known since a long time, dating back to Haavelmo (1943), see Figure 3. The result in Proposition 1 does not say anything about other sets of variables which satisfy the invariance assumption.
3.2 Invariant causal prediction
Roughly speaking, Haavelmo (1943) already realized that
The reverse relation
(7) 
has not been considered until recently (Peters et al., 2016). This might be due to the fact that with nowadays largescale data, it is much easier to infer invariance from data and thus, the implication from invariance to causal structures becomes much more interesting and useful.
The problem with the reverse implication (7) is the wellknown identifiability issue in causal inference. We typically cannot identify the causal variables , unless we have very many environments (or perturbations) or making specific assumptions on nonlinearities (Hoyer et al., 2009; Bühlmann et al., 2014)
, nonGaussian distributions
(Shimizu et al., 2006)or error variances
(Peters and Bühlmann, 2014). We will address the identifiability issue, which is often complicated in practice, in a fully “automatic” way as discussed next.The starting point is to perform a statistical test whether a subset of covariates satisfies the invariance assumption for the observed environments in
. The nullhypothesis for testing is:
and the alternative is the logical complement, namely that assumption (A) does not hold. It is worthwhile to point out that we only test with respect to the environments which are observed in the data. To address the identifiability issue, we intersect all subsets of covariates which lead to invariance, that is:
(8) 
The specification of a particular test is discussed below in Section
3.2.2. The procedure in (8) is called Invariant Causal
Prediction (ICP). The method is implemented in the Rpackage
InvariantCausalPrediction for linear models (Meinshausen, 2018b) and
nonlinearICP for nonlinear models (HeinzeDeml and Peters, 2017), see also
Section 3.2.2 below.
The computation of ICP in (8) can be expensive. There is an algorithm which provably computes ICP without necessarily going through all subsets (Peters et al., 2016): in the worstcase though, this cannot be avoided. If the dimension is large, we advocate some preliminary variable screening procedure based on regression for the data pooled over all environments, see also Section 3.2.2 for the case with linear models below. Such regressiontype variable screening procedures are valid when assuming a faithfulness condition: it ensures that the causal variables must be a subset of the relevant regression variables , see for example Spirtes et al. (2000).
We first highlight the property of controlling against false positive causal selections.
Theorem 1.
(Peters et al., 2016) Assume a structural equation model for the response as in (3.1) and that the environments or perturbations in satisfy the assumption (B). Furthermore, assume that the tests used in (8
) are valid, controlling the type I error. Then, for
we have thatThe interesting fact is that one does not need to care about identifiability: it is addressed automatically in the sense that if a variable is in , it must be identifiable as causal variable for
, at least with controllable probability
(e.g. being equal to ). For example, even if the environments in correspond to ineffective heterogeneities (e.g. no actual perturbations), the statement is still valid.Theorem 1 does not say anything about power. The power depends on the observed environments , besides sample size and the choice of a test. Roughly speaking, the power increases as becomes larger: the more heterogeneities or perturbations, the better we can identify causal effects and this is also true for the procedure in (8). In fact, Peters et al. (2016, Th.2) discuss cases where the ICP method in (8) is able to identify the all the causal variables, i.e., where asymptotically as sample size tends to infinity. These special cases, where essentially all the variables are perturbed, are far from a complete understanding of necessary and sufficient conditions for identifiability of the causal variables. Furthermore, the construction with the intersection in (8) might be often conservative. In terms of power, one wants to reject as many sets of covariates which are violating invariance. This seems awkward at first sight but the fact that the tests are highly dependent helps to increase the probability that all sets which do not fulfill the invariance hypothesis are rejected. A more quantitative statement of the latter and of power properties for ICP in general is difficult.
3.2.1 Some robustness properties.
The ICP procedure exhibits two robustness properties (and a third one is mentioned in Section 3.2.2 below).
Hidden confounding variables. Even in presence of hidden confounding variables (as in the scenario of Figure 5), we have the following: assuming a faithfulness condition (Spirtes et al., 2000, cf.),
where denotes the ancestor variables of . The details are given in Peters et al. (2016, Prop.5). In practice, this is interesting as we would still pick up some variables which indirectly have a total causal effect on .
Direct effects of environments on . The adhoc conditions 1 and 2 in Section 2.3 or the condition (B) are violated if the environments directly affect . With a faithfulness condition (Spirtes et al., 2000, cf.), we would then always infer that no set would fulfill the invariance assumption and therefore, as sample size gets sufficiently large, rejecting for all , we would obtain that . Therefore, even under violation of the assumption that the environments or perturbations should not act directly on , the ICP procedure gives a conservative answer and claims no variable to be causal. In the literature, this scenario is also known under the name of socalled invalid instrumental variables (Guo et al., 2018, cf.).
3.2.2 Concrete tests.
We first assume that the structural equation for in (3.1) is linear with Gaussian error:
and is independent of . The invariance hypotheses in then becomes:
 :

for all its holds that,
Thanks to the Gaussian assumption, exact tests for this nullhypothesis exist, for example with the Chow test (Chow, 1960). This is implemented in the Rpackage InvariantCausalPrediction (Meinshausen, 2018b).
The variable prescreening methods, mentioned above after the introduction of the ICP estimator (8), become also much simpler in linear models. One can use e.g. the Lasso (Tibshirani, 1996) on the data pooled over all observed environments and then employ the ICP estimator for all subsets of . To justify this, one needs to establish that, under , asymptotically: sufficient conditions for this are given in e.g. Bühlmann and van de Geer (2011).
When the true underlying model for the response in (3.1) is sufficiently nonlinear but if one uses ICP in (8) with invariance tests based on a misspecified Gaussian linear model, it typically happens that no set satisfies the invariance assumption resulting in . One could use instead a testing methodology for nonlinear and nonGaussian models to infer whether a subset of variables fulfills the invariance assumption. A corresponding proposal is given in HeinzeDeml et al. (2018) with the accompanying Rpackage nonlinICP (HeinzeDeml and Peters, 2017).
3.2.3 Application: single gene knockout experiments.
We briefly summarize here the results from an application to predict single gene interventions in yeast (Saccharomyces cerevisiae); for details we refer to Meinshausen et al. (2016). The data consists of mRNA expression measurements of genes in yeast. We have 160 observational measurements from wildtype yeast and 1479 interventional data arising from single gene perturbation, where a single gene has been deleted from a strain (Kemmeren et al., 2014). The goal is to predict the expression level of a new unseen gene perturbation, i.e., the potential outcome of a new unseen perturbation.
More specifically, and using the terminology of the framework outlined before, we aim to infer some of the (direct) causal variables of a target gene. Denote the gene expression measurements by . We consider as a response variable the expression of the th gene and the corresponding covariates the expressions of all other genes:
The index and the covariate dimension is . The aim is now to infer , assuming a linear structural equation model as in (3.1) but now with functions and being linear.
We construct the environments in a crude way: where the labels “1” and “2” denote the 160 observational and the interventional sample points, respectively. Thus, we pool all the interventional samples into one environment since we have no replicates of single gene perturbations. Other pooling schemes could be used as well: it is typically only an issue of power how to create good environments while the type I error control against false positive causal selections (Theorem 1) is still guaranteed, see also Section 3.2.4 below.
For validation, we do a trainingtest data splitting with a fold validation scheme of the interventional data: that is, we use all observational and of the interventional data with or . We only consider true strong intervention effects (SIEs) where an expression has a strong effect on (the perturbed value are outside of the observed data range).
The predictions are based on ICP (with Bonferroni correction due to using the ICP procedure many times, once for each gene being the response variable). One then finds that 8 significant genes at corrected significance level and 6 of them are true positive strong intervention effects (Peters et al., 2016). What sticks out is that only very few causal genes have been found: in a graph with 6170 nodes (corresponding to all the genes), only 8 significant directed edges are found. When prioritizing the most promising causal genes, Figure 4 describes ROCtype curves: here ICP is supplemented with stability selection (Meinshausen and Bühlmann, 2010) on top of it for creating a “stabilized” ranking.
3.2.4 Unknown environments.
If the environments are not known, one can try to estimate them from data. The type I error control against false positive causal selections holds as long as the estimated partition does not involve descendant variables of the response : for example, one could use some clustering algorithm based on nondescendants of .
In practice, it is sometimes reasonable to assume that certain variables are nondescendants of . A canonical case is with timesequential data: then, the environments can be estimated as different blocks of data at consecutive time points: this is some kind of a change point problem but now aimed for most powerful discovery with ICP. The methodology with timesequential data is developed and analyzed in Pfister et al. (2018) and implemented in the Rpackage seqICP (Pfister and Peters, 2017).
4 Anchor regression: relaxing conditions
The main concern with ICP in (8) and the underlying invariance principle is the violation of the assumption in (B), and thus also of the adconditions 1 and 2 from Section 2.3. Such a violation can happen under various scenarios and we mention a few in the following.
It could happen that only approximate instead of exact invariance holds. This would imply that we should only search for approximate invariance, something which we will incorporate in the anchor regression methodology described below in Section 4.2. Another scenario, say in a linear model, is that invariance occurs in the nullhypothesis for the parameter but with residual distributions which change for varying ; or viceversa with invariant residual distribution but different regression parameters for varying . This could be addressed in ICP by testing only either the parameter or residualpart, where invariance is assumed to hold for the causal variables.
Perhaps the most prominent violation is in terms of hidden confounding variables . The influence diagram from Figure 2 can then be extended to the situation of an instrumental variables (IV) regression model (Bowden and Turkington, 1990; Angrist et al., 1996; Rubin and Imbens, 2015), illustrated in Figure 5.
Now, it is more convenient to think of the environments (or instruments) as random variables and we also model all random variables in the system in terms of a structural equation model (unlike as in (3.1), where we have only one structural equation for ): instead of (3.1), we consider now the IV regression model
where are mutually independent of each other. The variable is hidden (not observed) and possibly confounding between and (if some components of are descendants of or , these are not relevant for inferring the effect from to and hence w.l.o.g. is a source node). Here, denotes the parental variables variables (variables which are parents of and from the set ). The main assumption in the IV regression model requires that the instruments or environments do not directly influence the response variable nor the hidden confounders (this is an extension of the adhoc conditions 1 and 2 from Section 2.3); and ideally, they would influence or change the variables in a sufficiently strong way (as with the adhoc aim in Section 2.3). We are not going into more details from the vast literature on IV models with e.g. weak instruments, invalid instruments or partially identifiable parameters, see for example Stock et al. (2002); Murray (2006); Kang et al. (2016); Guo et al. (2018). Instead, we will relax this main assumption for an instrument as discussed next.
4.1 The anchor regression model
We will allow now that the environments can act directly also on and , relaxing a main assumption in IV regression models. In the terminology of IV regression, we thus consider the case with socalled invalid instruments (Guo et al., 2018, cf.). This is an illposed situation for causal inference (from to ), yet it is still possible to obtain more meaningful results than what is obtained from standard regression methodology, see Section 4.3.2.
Instead of using the terminology “environment” we now us the word “anchor” (or anchor variable), for reasons which become more clear below. The structure of an anchor regression model is given by the graph in Figure 6.
The anchor regression model, for simplicity here only in linear form, is defined as follows: it is a structural equation model of the form,
(9) 
We assume that all the variables are centered with mean zero. There could be feedback cycles and the graph of the structure could have cycles. In the latter case, we assume that is invertible (which always holds if the graph is acyclic). The main assumption is that is a source node and thus, the contribution of enters as an additional linear term . Because of this, we use the terminology “anchor”: it is the anchor which is not influenced by other variables in the system and thus, it remains as the “static pole”.
As mentioned above, one cannot identify the causal parameter , the row and columns corresponding to and in the matrix . However, we we will discuss in Section 4.3, that we can still get an interesting solution which optimizes a worst case risk over a class of scenarios or perturbations , using the terminology and spirit of (3).
4.2 Causal regularization and the anchor regression estimator
We are particularly interested in the structural equation for in the model (9) which we write as
with and .
In the instrumental variables regression model where would directly influence only, it holds that are mutually independent (but not so in the anchor regression model) and this then implies that
Actually, we could substitute uncorrelatedness with independence in the IV model.
In cases where the causal parameter is nonidentifiable, one could look for the solution
(10) 
This leads to a unique parameter (assuming that is positive definite), and there is a simple pragmatic principle behind it. This principle and the property of uncorrelatedness of the residual term with the anchor variables also plays a key role in the anchor regression model for a class of socalled shift perturbations.
Similar to the idea in (10), we define the anchor regression estimator by using a regularization term, referred to as causal regularization, which encourages orthogonality or uncorrelatedness of the residuals with the anchor variables . We denote the data quantities by the response vector , the covariate design matrix and the matrix of the observed anchor variables. Let the projection in onto the column space of . In practice, if the columns of and are not centered, we would include an intercept column in . We then define
(11) 
We implicitly assume here that . For ,
equals the ordinary least squares estimator, for
we obtain the twostage least squares procedure from IV regression and for we adjust for the anchor variables in . The properties of the anchor regression estimator in (11) are discussed next and make the role of the tuning parameter more clear.The criterion function on the righthand side of (11) is a convex function in ; for highdimensional scenarios, we can add an norm penalty, or any other sparsity inducing penalty:
(12) 
4.3 Shift perturbations and robustness of the anchor regression estimator
The anchor regression estimator solves a worst case risk optimization problem over a class of shift perturbations.
We define the system under shift perturbations by the same equations as in (9) but replacing the term from the contributions of the anchor variables by a deterministic or stochastic perturbation vector . That is, the system under shift perturbations satisfies:
The shift vector is assumed to be in the span of , that is for some vector . The class of considered perturbations, denoted earlier as , are shift perturbations as follows:
(13)  
Thus, contains shift perturbations whose length is typically as .
For the case with one can characterize shiftinvariance of residuals as follows.
Proposition 2.
(Rothenhäusler et al., 2018, Th.3) Assume that is positive definite. Consider
Since and have mean zero it follows also that and hence . Then,
With the goal to make the residuals invariant (for the class of shift perturbations), we aim to estimate the regression parameter such that the residuals are encouraged to be fairly uncorrelated with . This leads to the construction of the anchor regression estimator in (11) or (12).
A more general result than Proposition 2 is possible, uncovering a robustness property of the anchor regression estimator. We focus first on the population case. We denote by the projection operator, namely . In the anchor regression model (9) with being invertible, and are linear functions in . The population version of the anchor regression estimator is
(14) 
Then, the following fundamental result holds.
Theorem 2.
(Rothenhäusler et al., 2018, Th.1) For any it holds that
Thus, Theorem 2 establishes an exact duality between the causal regularized risk (which is the population version of the objective function for the estimator in (11)) and worst case risk over the class of shift perturbations. The regularization parameter equals the “strength” of the shift, as defined in (13): regarding its choice, see Section 4.3.1. A useful interpretation of the theorem is as follows. The worst case risk over shift perturbations can be considered as the one corresponding to future unseen data: this risk for future unseen data can be represented as a regularized risk for the data which we observe in the training sample. We further note that Theorem 2 holds for any , and thus it also holds when taking the “argmin” on both sides of the equation. We then obtain that the population version in (14) is the minimizer of the worst case risk:
One can argue that also in the finite sample highdimensional sparse scenario, the anchor regression estimator in (11) or (12) are asymptotically optimizing the worst case risk:
where, under suitable conditions, as , or in the highdimensional scenario. The details are given in Rothenhäusler et al. (2018, Sec.4).
4.3.1 Choosing the amount of causal regularization.
The value of in the estimator (11) or (12) relates to the class of shift perturbations over which we achieve the best protection against the worst case, see Theorem 2. Thus, we could decide apriori how much protection we wish to have or how much perturbation we expect to have in new test data.
Alternatively, we could do some sort of crossvalidation. If the anchor variable encodes discrete environments, we could leave out data from one or several environments and predict on the leftout testdata optimizing the worst case error. If the anchor variables are continuous, the following characterization is useful.
The value
in the causal regularization has also an interpretation as a quantile. Assuming a joint Gaussian distribution of the variables
and in the model (9), it holds that(15)  
The righthand side also equals a worst case risk over shift perturbations, as stated in Theorem 2. Therefore, the relation above links the insample (for the nonperturbed data) quantiles to the outsample (for the perturbed data) worst case risk. The exact correspondence of and might not hold for more general situations. However, the qualitative correspondence is that a large (high quantile) corresponds to a high value for the regularization term. Thus, one could choose a quantile value , e.g. , for the quantile of the conditional expectation of the squared error and then calculate the which optimizes this quantile. Under a Gaussian assumption, we can estimate the quantities replacing expectations by mean squared test samples. This result also indicates that anchor regression with a large value of should result in good values for the high quantile of the squared prediction error (unconditional on ).
4.3.2 Diluted form of causality.
In the anchor regression model in general, it is impossible to infer the direct causal parameter from to . If the assumptions for instrumental variables regression are fulfilled, i.e., no direct effects from to and to and , then the anchor regression estimator with equals the unique two stage least squares estimator and consistently infers ; in particular, we also have that .
If the IV assumptions do not hold, for example in presence of invalid instruments where the anchor variables directly affect or , the parameter with or being large is still a much more meaningful quantity than the standard regression parameter (with ). For large values of , the corresponding is minimizing a worst case risk over a class of large shift perturbations. This parameter and its entries with large values corresponding to important variables is interesting in many applications: the variables (with corresponding large parameter components of ) are “key drivers” in a system of interest to explain the response in a stable manner over many perturbations. In fact, for , we define
to be the set of variables which are “diluted causal” for the response (the variables which are relevant for in the framework of “diluted causality”).
4.4 Some empirical illustrations
We illustrate the performance and behavior of the estimator (11) in the linear anchor regression model (9). We consider the case where the anchors are invalid instruments and lowdimensional and hence, inferring the causal effects from to is impossible. The model for the variables and is as in model (M3) described later in Section 5.4, with , and . The structural equation for the response is
The training sample size is chosen as . The test sample is constructed with the same structure but now the anchor variables are multiplied by the factor and the test sample size is chosen as . It is instructive to describe here how the anchor variables act on : the model is
where are coefficients which have been sampled i.i.d. from . The equation above changes in the test sample where we multiply and with the factor which results in perturbations for the variables. Figure 7 describes the quantiles of the absolute outsample prediction error where has been prespecified. We also show the empirical relation between the insample quantile in (15) and the outsample mean squared prediction error .
We conclude from the left panel of Figure 7 that the anchor regression estimator exhibits a substantially better prediction performance under perturbation outsample scenarios than the ordinary least squares estimator. If the outsample data would be generated as the insample training data, that is without new perturbations in the test data, there would be no gain, or actually a slight loss, of anchor regression over OLS (empirical results not shown here). Anchor regression only paysoff for prediction if some perturbations happen in new future data points which amplify the effect of heterogeneity (generated from the anchor variables ) in the future test data. This is briefly discussed next.
4.5 Distributional robustness
Anchor regression and causality can be viewed from the angle of distributional robustness (HeinzeDeml and Meinshausen, 2017; Meinshausen, 2018a). Distributional robustness refers to optimizing a worst case risk over a class of distributions:
where
denotes a loss function,
is the random variable generating a data point (e.g. ), is an unknown (potentially high or infinitedimensional) parameter andis a class of probability distributions.
A typical choice for the class of distributions is
where is the reference distribution, for example being the empirical measure, is a metric, for example the Wasserstein distance, and is a prespecified radius, see for example Sinha et al. (2017); Gao et al. (2017).
For causality and anchor regression, the class of distributions is given by a causal or “anchortype” model consisting of perturbation distributions. Theorem 2 describes the connection more explicitly: the class consists of amplifications of the observed heterogeneity in the data. This, because the perturbation distributions arise from shifts , thus being shifts in the direction of the effects from the observed anchor variable contribution which equals the term in the model (9); and the strength of the shifts in the perturbations is given by the parameter which has an analogous role as the radius in the definition of above. Thus, with anchor regression, the class of distributions is not predefined via a metric and a radius but rather through the observed heterogeneities in and a strength of perturbations or “radius” .
5 Nonlinear anchor regression
We present here a methodology for anchor regression generalizing the linear case. A core motivation is to design an algorithm for which any “machine learning” technology for regression can be pluggedin, including for example Random Forests or even Deep Neural Nets. We will argue that in presence of heterogeneity, essentially “any” of these machine learning methods can be improved using additional causal regularization.
We consider a nonlinear anchor regression structural equation model where the dependence of on is a nonlinear function:
(16) 
We can consider more general functions although a too high degree of nonlinearity can become more difficult with the anchor regression algorithm presented below. This is discussed after Corollary 1. We note that with and being equal to zero we have a nonlinear instrumental variables regression model with linear dependences on the instruments (or the anchors) and the hidden variables .
5.1 The objective function and the algorithm
Consider a nonlinear regression function , defined as
which is a map from to , where and are the domains of and , respectively. Given data , many estimation methods or algorithms for can be written in the form
(17) 
where denotes a suitable subclass which incorporates certain restrictions such as smoothness or sparsity. On the righthand side we have used a slight abuse of notation where denotes the vector of function values at the observed . As will be seen below, the estimation algorithm is not necessarily of the form as in (17) but we use this formulation for the sake of simplicity.
Using the abbreviated notation with mentioned above, the nonlinear anchor regression estimator is defined as:
(18) 
As in (11), denotes the linear projection onto the column space of the observed anchor variable matrix . If the anchor variables have a linear effect onto , and the hidden confounders , it is reasonable to consider the estimator with the linear projection operator . We will give some justification for it in Section 5.5.
It is straightforward to see that the objective function can be represented as
5.2 Anchor Boosting: a “regularized” approximation of the estimator
The question is how to compute or approximate the estimator in (5.1). We aim here for a solution where standard existing software can be used,
Our proposal is to use boosting. For this, we consider the negative gradient
and pursue iterative fitting of the negative gradient. The negative gradient fitting is done with a prespecified base learner (or “weak learner”): it is a regression estimator based on input data with denoting a response vector (e.g.
corresponds to the estimator applied to the original data). This is the standard recipe of gradient boosting
(Breiman, 1999; Friedman, 2001; Bühlmann and Yu, 2003; Bühlmann and Hothorn, 2007) and the method is summarized in Algorithm 1.The choice of the stopping iteration is discussed in Section 5.2.1. The stopping iteration is a regularization parameter: it is governing a biasvariance tradeoff, on top of the regularization of causal regularization from anchor regression which is encoded in the matrix .
From another view point for regularization, we can think of the regression estimator as an operator (when evaluated at the observed ; it is a “hat” operator):
Then, it is straightforward to see that
that is, the boosting operator at iteration is equal to . If has a suitable norm being strictly , then there is geometrical convergence to the identity which would fit the data perfectly. This indicates, that we should stop the boosting procedure to avoid overfitting. However, especially for large values of in , the norm of will be larger than one and geometrical contraction, i.e. overfitting, to the response vector will not happen.
5.2.1 Some criteria for choosing the stopping iteration.
In connection with a Random Forests (Breiman, 2001) learner for the estimator in step 5 of Algorithm 1 we found that we can choose the stopping iteration such as to minimize the objective function or even overshooting the minimum by say 10%. Formally, the two stopping rules are:
(19)  
(20) 
In general, we propose a rule based on the following observation. Consider the population version of and denote it by
where denotes the best linear projection onto . That is,
Thus, is a function of and random.
The population version of the optimization is
due to linearity of ; here denotes the class of measurable functions of . We decompose this problem into two parts:
This motivates a finite sample criterion guarding against overfitting. Whenever we estimate by with boosting as in Algorithm 1, the residual sum of squares should be at least as large as the residual sum of squares of of any good and reasonably tuned machine learning estimator . That is, we should choose the number of boosting iterations such that it minimizes the objective function under the given constraint:
(21) 
This guards against overfitting and avoids choosing a too large boosting iteration. We could also modify the rule to choose as in (20) under the constraint that .
5.2.2 Random Forests learner with a linear model component.
Besides using Random Forests as a base learner in the AnchorBoosting Algorithm 1, we consider also a modification which fits a linear model first and applies Random Forests on the resulting residuals. This modification is denoted by “LM+RF”, standing for Linear Models+Random Forests. The LM+RF procedure is built on the idea that the linear model part is the “primary part” and the remaining nonlinearities are then estimated by Random Forests. This base leaner is able to cope well with estimating partial linear functions.
The “LM+RF” algorithms is defined as follows. Given a response variable and some covariates :

Fit a linear model of versus
, by default including an intercept. The fitted (linear) regression function is denoted by
. 
Compute the residuals from step 1, denoted by . Fit a Random Forests of (being now the response variable) versus : the fitted regression function is denoted by .

The final estimator is .
The LM+RF base learner is typically outperforming plain Random Forests if the underlying regression function is an additive combination of a linear and a nonlinear function.
5.2.3 Plugin of any “machine learning” algorithm.
5.3 Some empirical results
We consider the following structural equation model for the insample data used for training with sample size .
Comments
There are no comments yet.