1 Introduction
Deep neural networks (DNNs) have achieved outstanding performance on prediction tasks like visual object and speech recognition (Krizhevsky et al., 2012; Szegedy et al., 2015; He et al., 2015). Issues can arise when the learned representations rely on dependencies that vanish in test distributions, see for example QuioneroCandela et al. (2009); Torralba and Efros (2011); Csurka (2017)
and references therein. Such domain shifts can be caused by changing conditions such as color, background or location changes. Predictive performance is then likely to degrade. The “Russian tank legend” is an example where the training data was subject to sampling biases that were not replicated in the real world. Concretely, the story relates how a machine learning system was trained to distinguish between Russian and American tanks from photos. The accuracy was very high but only due to the fact that all images of Russian tanks were of bad quality while the photos of American tanks were not. The system learned to discriminate between images of different qualities but would have failed badly in practice
(Emspak, 2016)^{1}^{1}1A different version of this story can be found in Yudkowsky (2008).. For a directly equivalent example, see §7.2. Existing biases in datasets used for training machine learning algorithms tend to be replicated in the estimated models
(Bolukbasi et al., 2016). For another example involving Google’s photo app, see Crawford (2016) and Emspak (2016). In §7 we show many examples where unwanted biases in the training data are picked up by the trained model. As any bias in the training data is in general used to discriminate between classes, these biases will persist in future classifications, raising also considerations of fairness and discrimination (Barocas and Selbst, 2016).Addressing the issues outlined above, we propose Conditional variance Regularization (CoRe) to give differential weight to different latent features. Conceptually, we take a causal view of the data generating process and categorize the latent data generating factors into ‘conditionally invariant’ (core) and ‘orthogonal’ (style) features, as in (Gong et al., 2016)
. It is desirable that a classifier uses only the
core features as they pertain to the target of interest in a stable and coherent fashion. Basing a prediction on the core features alone yields a stable predictive accuracy even if the style features are altered. CoRe yields an estimator which is approximately invariant under changes in the conditional distribution of the style features (conditional on the class labels). Consequently, it is robust with respect to adversarial domain shifts, arising through arbitrarily strong interventions on the style features. CoRe relies on the fact that for certain datasets we can observe ‘grouped observations’ in the sense that we observe the same object under different conditions. Rather than pooling over all examples, CoRe exploits knowledge about this grouping, i.e. that a number of instances relate to the same object. By penalizing betweenobject variation of the prediction less than variation of the prediction for the same object, we can steer the prediction to be based more on the latent core features and less on the latent style features.The remainder of this manuscript is structured as follows: §2 starts with a few motivating examples, showing simple settings where the style features change in the test distribution such that standard empirical risk minimization approaches would fail. In §3 we review related work, introduce notation in §4 and in §5 we formally introduce conditional variance regularization CoRe. In §6, CoRe is shown to be equivalent to minimizing the risk under noise interventions in a regression setting and is shown to lead to adversarial risk consistency in a partially linear classification setting. In §7 we evaluate the performance of CoRe in a variety of experiments.
To summarize, our contributions are the following:

Causal framework. We extend the causal framework of Gong et al. (2016) to address situations where the domain variable itself is latent.

Conditional variance penalties and distributional robustness. We introduce conditional variance penalties, which are equivalent to a suitable graph Laplacian penalty. For classification, we show in Theorem 1 that we can achieve consistency under a risk definition that allows adversarial domain changes. For regression, we show in Theorem 6.2 that estimator achieves distributional robustness against intervention distributions where the noise variance of domainspecific noise is increased. A onetoone correspondence between the penalty parameter and the set of distributions we are protected against is shown.

Software.
We illustrate our ideas using synthetic and realdata experiments. A TensorFlow implementation of
CoRe as well as code to reproduce some of the experimental results are available at https://github.com/christinaheinze/core.
2 Motivating examples
To motivate the methodology we propose, consider the examples shown in Figure 1. Example 1 shows a setting where a linear decision boundary is suitable. Panel LABEL:sub@ex1_train
shows the training data where class 1 is associated with red points, dark blue points correspond to class 0. If we were asked to draw a decision boundary based on the training data, we would probably choose one that is approximately horizontal. The style feature here corresponds to a linear direction
. Panel LABEL:sub@ex1_test shows the test set where the style feature is intervened upon for class 1 observations: class 1 is associated with orange squares, cyan squares correspond to class 0. Clearly, a horizontal decision boundary would have misclassified all test points of class 1.Example 2 shows a setting where a nonlinear decision boundary is required. Here, the core feature corresponds to the distance from the origin while the style feature corresponds to the angle between the
axis and the vector from the origin to
. Panel LABEL:sub@ex2_train shows the training data and panel LABEL:sub@ex2_test additionally shows the test data where the style—i.e. the distribution of the angle—is intervened upon. Clearly, a circular decision boundary yields optimal performance on both training and test set but is unlikely to be found by a standard classification algorithm when only using the training set for the estimation. We will return to these examples in §5.4.Lastly, we mimic the Russian tank legend in the third example by manipulating the face images from the CelebA dataset (Liu et al., 2015): in the training set images of class “wearing glasses” are associated with a lower image quality than images of class “not wearing glasses”. Examples are shown in panel (e). In the test set, this relation is reversed, i.e. images showing persons wearing glasses are of higher quality than images of persons without glasses, with examples in panel (f). We will return to this example in §7.2
and show that training a convolutional neural network to distinguish between people wearing glasses or not works well on test data that are drawn from the same distribution (with error rates below 2%) but fails entirely on the shown test data, with error rates worse than 65%.
3 Related work
For general distributional robustness, the aim is to learn
(1) 
for a given set of distributions, loss , and prediction . The set is the set of distributions on which one would like the estimator to achieve a guaranteed performance bound and the set is often taken to be of the form with
(2) 
with a small constant and being, for example, a divergence (Namkoong and Duchi, 2017; BenTal et al., 2013; Bagnell, 2005) or a Wassersteindistance (ShafieezadehAbadeh et al., 2017; Sinha et al., 2017; Gao et al., 2017). The distribution can be the true (but generally unknown) population distribution from which the data were drawn or its empirical counterpart . The distributionally robust targets (1) can often be expressed in penalized form; see Gao et al. (2017); Sinha et al. (2017); Xu et al. (2009).
In this work, we do not try to achieve robustness with respect to a set of distributions that are predefined by a KullbackLeibler divergence or a Wasserstein metric as in (
2). We try to achieve robustness against a set of distributions that are generated by interventions on latent style variables. As such the right distribution set in (1) has to be learnt from data and we need a causal model to define the set of distributions we would like to protect ourselves against.Similar to this work in terms of their goals are the work of Gong et al. (2016) and DomainAdversarial Neural Networks (DANN) proposed in Ganin et al. (2016), an approach motivated by the work of BenDavid et al. (2007). The main idea of Ganin et al. (2016) is to learn a representation that contains no discriminative information about the origin of the input (source or target domain). This is achieved by an adversarial training procedure: the loss on domain classification is maximized while the loss of the target prediction task is minimized simultaneously. The data generating process assumed in Gong et al. (2016) is similar to our model, introduced in §4.2, where we detail the similarities and differences between the models (cf. Figure 2). Gong et al. (2016) identify the conditionally independent features by adjusting a transformation of the variables to minimize the squared MMD distance between distributions in different domains^{2}^{2}2The distinction between ‘conditionally independent’ features and ‘conditionally transferable’ (which is the former modulo location and scale transformations) is for our purposes not relevant as we do not make a linearity assumption in general.. The fundamental difference between these very promising methods and our approach is that we use a different data basis. The domain identifier is explicitly observable in Gong et al. (2016) and Ganin et al. (2016), while it is latent in our approach. In contrast, we exploit the presence of an identifier variable that relates to the identity of an object (for example identifying a person). In other words, we do not assume that we have data from different domains but just different realizations of the same object under different interventions.
Causal modeling has related aims to the setting of transfer learning and guarding against adversarial domain shifts. Specifically, causal models have the defining advantage that the predictions will be valid even under arbitrarily large interventions on all predictor variables
(Haavelmo, 1944; Aldrich, 1989; Pearl, 2009; Schölkopf et al., 2012; Peters et al., 2016; Zhang et al., 2013, 2015; X. Yu, 2017; M. RojasCarulla, 2017; Magliacane et al., 2017). There are two difficulties in transferring these results to the setting of adversarial domain changes in image classification. The first hurdle is that the classification task is typically anticausal since the image we use as a predictor is a descendant of the true class of the object we are interested in rather than the other way around. The second challenge is that we do not want to guard against arbitrary interventions on any or all variables but only would like to guard against a shift of the style features. It is hence not immediately obvious how standard causal inference can be used to guard against large domain shifts. Recently, various approaches have been proposed that leverage causal motivations for deep learning or use deep learning for causal inference.
Chalupka et al. (2014)characterize learning the visual causes for a certain target behavior and thereby model perceiving systems. Various approaches focus on causeeffect inference where the goal is to find the causal relation between two random variables,
and (LopezPaz et al., 2017; LopezPaz and Oquab, 2017; Goudet et al., 2017). LopezPaz et al. (2017) propose the Neural Causation Coefficient (NCC) to estimate the probability of causing and apply it to finding the causal relations between image features. Specifically, the NCC is used to distinguish between features of objects and features of the objects’ contexts. LopezPaz and Oquab (2017) note the similarity between structural equation modeling and CGANs (Mirza and Osindero, 2014). One CGAN is fitted in the direction and another one is fitted for. Based on a twosample test statistic, the estimated causal direction is returned.
Goudet et al. (2017) use generative neural networks for causeeffect inference, to identify structures and to orient the edges of a given graph skeleton. Bahadori et al. (2017) devise a regularizer that combines an penalty with weights corresponding to the estimated probability of the respective feature being causal for the target. The latter estimates are obtained by causality detection networks or scores such as estimated by the NCC. Besserve et al. (2017) draw connections between GANs and causal generative models, using a group theoretic framework. Kocaoglu et al. (2017) propose causal implicit generative models to sample from conditional as well as interventional distributions, using a conditional GAN architecture (CausalGAN). The generator structure needs to inherit its neural connections from the causal graph, i.e. the causal graph structure must be known. Louizos et al. (2017) propose the use of deep latent variable models and proxy variables to estimate individual treatment effects. Kilbertus et al. (2017) exploit causal reasoning to characterize fairness considerations in machine learning. Distinguishing between the protected attribute and its proxies, they derive causal nondiscrimination criteria. The resulting algorithms avoiding proxy discrimination require classifiers to be constant as a function of the proxy variables in the causal graph, thereby bearing some structural similarity to our style features. Distinguishing between core and style features can be seen as some form of disentangling factors of variation. Estimating disentangled factors of variation has gathered a lot of interested in the context of generative modeling (Higgins et al., 2017; Chen et al., 2016; Bouchacourt et al., 2017). For example, Matsuo et al. (2017)propose a “Transform Invariant Autoencoder” where the goal is to reduce the dependence of the latent representation on a specified transform of the object in the original image. Specifically,
Matsuo et al. (2017) predefine location as the style feature and the goal is to learn a latent representation that does not include . Our approach is different as we do not predefine which features are considered style features. The style features in our approach could be location but also image quality, posture, brightness, background and contextual information or something entirely unknown. We try to learn a representation of style and core features from data by exploiting the grouping of training samples. Additionally, the approach in Matsuo et al. (2017) cannot effectively deal with a confounding situation where the distribution of the style features differs conditional on the class (this is a natural restriction for their work, however, as the class label is not even observed in the autoencoder setting). As in CoRe, Bouchacourt et al. (2017) exploit grouped observations. In a variational autoencoder framework, they aim to separate style and content—they assume that samples within a group share a common but unknown value for one of the factors of variation while the style can differ. Denton and Birodkar (2017) propose an autoencoder framework to disentangle style and content in videos using an adversarial loss term where the grouping structure induced by clip identity is exploited. Here we try to solve a classification task directly without estimating the latent factors explicitly as in a generative framework.4 Setting
We first describe the general notation used before describing the causal graph that allows us to compare the setting of adversarial domain shifts to transfer learning, domain adaptation and adversarial examples.
4.1 Notation
Let be a target of interest. Typically for regression or in classification with classes. Let be a predictor, for example the pixels of an image. The prediction for , given , is of the form for a suitable function with parameters , where the parameters correspond to the weights in a DNN. For regression, , whereas for classification
corresponds to the conditional probability distribution of
. Let be a suitable loss that maps and to . A standard goal is to minimize the expected loss or riskLet for be the samples that constitute the training data and the prediction for . The standard approach is to simply pool over all available observations, ignoring any grouping information that might be available. The pooled estimator thus treats all examples identically by summing over the empirical loss as
(3) 
where is a complexity penalty, for example a ridge term on the weights of a convolutional neural network. All examples that compare to the pooled estimator will include a ridge penalty as default. Different penalties can exploit underlying geometries, such as the Laplacian regularized least squares (Belkin et al., 2006). In fact, the proposed estimator will be of this form, exploiting grouping information in the data.
4.2 Causal graph
The full causal structural model for all variables is shown in the panel (b) of Figure 2. The domain variable is latent, in contrast to Gong et al. (2016) whose model is shown in panel (a) of Figure 2. We add the variable (identity of a person, for example), whose distribution can change conditional on . In Figure 2, but in some settings it might be more plausible to consider . For our proposed method both options are possible. The variable is used to group observations. The variable is typically discrete and relates to the identity of the underlying object. The variable can be assumed to be latent in the setting of Gong et al. (2016).
The rest of the graph is in analogy to Gong et al. (2016). The prediction is anticausal, that is the predictors that we use for are nonancestral to . In other words, the class label is causal for the image and not the other way around. The causal effect from the class label on the image is mediated via two types of latent variables: the socalled core or ‘conditionally invariant’ features and the orthogonal or style features . The distinguishing factor between the two is that external interventions are possible on the style features but not on the core features. If the interventions have different distributions in different domains, then the distribution is constant across domains while can change across domains. The style features and are confounded, in other words, by the latent domain . In contrast, the core or ‘conditionally invariant’ features satisfy . The style variable can include point of view, image quality, resolution, rotations, color changes, body posture, movement etc. and will in general be contextdependent^{3}^{3}3The type of features we regard as style and which ones we regard as core features can conceivably change depending on the circumstances—for instance, is the color “gray” an integral part of the object “elephant” or can it be changed so that a colored elephant is still considered to be an elephant?. The style intervention variable influences both the latent style , and hence also the image . In potential outcome notation, we let be the style under intervention and the image for class , identity and style intervention . The latter is sometimes abbreviated as for notational simplicity. Finally, is the prediction under the style intervention . For a formal justification of using a causal graph and potential outcome notation simultaneously see Richardson and Robins (2013).
To be specific, if not mentioned otherwise we will assume a causal graph as follows. For independent in respectively with positive density on their support and continuously differentiable functions , and ,
identifier  
core or conditionally invariant features  
style or orthogonal features  
image  (4) 
Of these, , and are observed whereas and the noise variables are latent. The model can be generalized by allowing further independent noise terms inside and but the model above is already fairly general and keeps notational simplicity more constrained than the fully general version.
4.3 Data
To summarize, we assume we have samples for , where the observations with of variable can also contain unobserved values. Let be the number of unique realizations of and let be a partition of such that, for each , the realizations are identical^{4}^{4}4Observations where the variable is unobserved are not grouped, that is each such observation is counted as a unique observation of . for all . The cardinality of is denoted by . Then is again the total number of samples and , the total number of grouped observations. Typically for most samples and occasionally but one can also envisage scenarios with larger groups of the same identifier .
4.4 Domain adaptation, adversarial examples and adversarial domain shifts
In this work, we are interested in guarding against adversarial domain shifts. We use the causal graph to explain the related but not identical goals of domain adaptation, transfer learning and guarding against adversarial examples.

Domain adaptation and transfer learning. Assume we have different domains, each with a new distribution
for the joint distribution of
. The shift of for different domains causes a shift in both the distribution of and in the conditional distribution . If we consider domain adaptation and transfer learning together, the goal is generally to give the best possible prediction in each domain . In contrast, we do not aim to give the best possible prediction in each domain as we aim to infer a single prediction that should work as well as possible in a worstcase sense over a set of distributions generated by domain changes. Some predictive accuracy needs to be sacrificed compared to the best possible prediction in each domain. 
Standard adversarial examples. The setting of adversarial examples in the sense of Szegedy et al. (2014) and Goodfellow et al. (2015) can also be described by the causal graph above by using and identifying with pixelbypixel additive effects. The magnitude of the intervention is then typically assumed to be within an ball in norm around the origin, with or for example. If the input dimension is large, many imperceptible changes in the coordinates of can cause a large change in the output, leading to a misclassification of the sample. The goal is to devise a classification in this graph that minimizes the adversarial loss
(5) where is the image under the intervention and is the estimated conditional distribution of , given the image under the chosen interventions. See Sinha et al. (2017) for recent work on achieving robustness to a predefined class of distributions.

Adversarial domain shifts. Here we are interested in arbitrarily strong interventions on the style features , which are not known explicitly in general. Analogously to (5), the adversarial loss under arbitrarily large style interventions is
(6) In contrast to (5) the interventions can be arbitrarily strong but we assume that the style features can only change certain aspects of the image, while other aspects of the image (mediated by the core features) cannot be changed. In contrast to Ganin et al. (2016), we use the term “adversarial” to refer to adversarial interventions on the style features, while the notion of “adversarial” in domain adversarial neural networks describes the training procedure. Nevertheless, the motivation of Ganin et al. (2016) is equivalent to ours—that is, to protect against shifts in the distribution(s) of test data which we characterize by distinguishing between core and style features. We also look at random interventions . Each distribution of the random interventions is inducing a distribution for . Let be the set of all such induced distributions. We then try to mimize the worstcase across this distribution class, as in (1), with the difference to standard distributional robustness being that the set takes a specific form induced by the causal graph.
The adversarial loss of the pooled estimator (3) will in general be infinite; see §6.1 for a concrete example. Using panel (b) in Figure 2, one can show that the pooled estimator will work well in terms of the adversarial loss if both (i) and (ii) . The first condition (i) implies that if the estimator learns to extract from the image , there is no further information in that explains and, therefore, the direction corresponding to is not required for predicting . The second condition (ii) prevents that the relations between , , and are deterministic and ensures that cannot replace in the first condition. From (i) and (ii), we see that the pooled estimator will work well in terms of the adversarial loss if (a) the edge from to is absent or if (b) both the edge from to and the edge from to are absent (cf. Figure 2).
5 Conditional variance regularization
5.1 Invariant parameter space
In order to minimize the adversarial loss (6) we have to ensure is as constant as possible as a function of the style variable for all . Let be the invariant parameter space
For all , the adversarial loss (6) is identical to the loss under no interventions at all. More precisely, let be a shorthand notation for , the images in absence of external interventions:
The optimal predictor in the invariant space is
(7) 
If is only a function of the core features , then . The challenge is that the core features are not directly observable and we have to infer the invariant space from data.
5.2 CoRe estimator
To get an approximation to the optimal invariant parameter vector (7), we use empirical risk minimization under an invariance constraint:
(8) 
where the first part is the empirical version of the expectation in (7). The unknown invariant parameter space is approximated by an empirically invariant space . For all structural equation models of the form (4), the invariant space is constrained by the space of models that have vanishing expected conditional variance in the sense that
is the expected conditional variance of , given . As empirical approximation we use
(9) 
is an estimate of the expected variance (details below). Setting is equivalent to demanding that the conditional variance vanishes which implies that the estimated predictions for the class labels are identical across all images that share the same identifier while slightly larger values of allow for some small degree of variations. Under the right assumptions we get for and . We return to this question in §6.1. One can equally use the Lagrangian form of the constrained optimization in (8), with a penalty parameter instead of a constraint , to get
(10) 
We will give an explicit interpretation of this conditional variance penalty in §6.2. We can also add a standard ridge penalty in addition to the conditional variance penalty.
Before showing numerical examples, we first discuss the estimation of the expected conditional variance in §5.3, before returning to the simple examples of §2 in §5.4. Adversarial risk consistency in a classification setting for a partially linear version of (4) is shown in §6.1. Furthermore, we discuss the population limit of the penalized version in §6.2, where we show that the regularization parameter is proportional to the size of the future style interventions (or rather proportional to the magnitude of the noise on the style variables) that we want to guard against for future test data.
5.3 Estimating expected conditional variance as a graph Laplacian
Recall that contains samples with identical realizations of for . For each , define the average across all as . As estimator of the conditional variance we use
where the right hand side can also be interpreted as the graph Laplacian (Belkin et al., 2006) of an appropriately weighted graph that fully connects all observations for each . If there are no groups of samples that share the same identifier , the graph Laplacian is zero and we also define to vanish in this case. The CoRe estimator is then identical to pooled estimation in this special case.
As an alternative to penalizing with the expected conditional variance of the predicted response, we can constrain by looking at the expected conditional variance of the loss
and get an empirical estimate as
(11) 
The penalty is then taking a similar form to Namkoong and Duchi (2017). A crucial difference of our approach is that we penalize with the expected conditional variance. That we take a conditional variance is here important as we try to achieve distributional robustness with respect to interventions on the style variables. Conditioning on allows to guard specifically against these interventions. An unconditional variance penalty, in contrast, can achieve robustness against a predefined class of distributions such as a ball of distributions defined in a KullbackLeibler or Wasserstein metric; see for example Sinha et al. (2017) for an application in the context of adversarial examples. Some further discussion is in section §6.2. If not mentioned otherwise we use the conditional variance of the predictions as in (9) as a conditional variance penalty.
5.4 Classification example
We revisit the first and the second example from §2. Figure 3 shows the respective training and test sets with the estimated decision boundaries for different values of the penalty parameter . Additionally, grouped examples that share the same are visualized: two grouped observations are connected by a line or curve, respectively. In each example, there are eight such groups (only clearly visible in the nonlinear example). Panel LABEL:sub@ex1_train_outcome shows the linear decision boundaries for , equivalent to the pooled estimator, and for CoRe with . The pooled estimator misclassifies all test points of class 1 as can be seen in panel LABEL:sub@ex1_test_outcome. In contrast, the decision boundary of the CoRe estimator aligns with the direction along which the grouped observations vary, classifying the test set with almost perfect accuracy. Panels LABEL:sub@ex2_train_outcome and LABEL:sub@ex2_test_outcome show the corresponding plots for the second example for penalty values . While all of them yield good performance on the training set, only a value of , which is associated with a circular decision boundary, achieves almost perfect accuracy on the test set.
6 Adversarial risk consistency and distributional robustness
We show two properties of the CoRe estimator. First, adversarial risk consistency is shown for logistic models. Second, we show that the population CoRe estimator protects optimally against an increase in the variance of the noise in the style variable in a regression setting.
6.1 Adversarial risk consistency for classification and logistic loss
We analyze the adversarial loss, defined in Eq. (6), for the pooled and the CoRe
estimator in a onelayer network for binary classification (logistic regression). The proof is given in §
A.Assume the structural equation for the image is linear in the style features (with generally ) and we use logistic regression to predict the class label . Let the interventions act additively on the style features (this is only for notational convenience) and let the style features act in a linear way on the image via a matrix (this is an important assumption without which results are more involved). The core or ‘conditionally invariant’ features are , where in general but this is not important for the following. For independent in respectively with positive density on their support and continuously differentiable functions ,
class  
identifier  
core or conditionally invariant features  
style or orthogonal features  
image  (12) 
Of these, , and are observed whereas and the noise variables are latent. The distribution of can depend on the unobserved domain.
We assume a logistic regression as a prediction of from the image data :
Given training data with samples, we estimate with and use here a logistic loss for training and testing. We want to compare the following losses on test data
where the in the first loss is a shorthand notation for , that is the images in absence of interventions on the style variables. The first loss is thus a standard logistic loss in absence of adversarial interventions. The second loss is the loss under adversarial style or domain interventions as we allow arbitrarily large interventions on here. The corresponding benchmarks are
The formulation of Theorem 1 relies on the following assumptions.
Assumption 1
We require the following conditions:

Assume is sampled from a distribution for training data in with positive density (with respect to the Lebesgue measure) in an ball in norm around the origin for some .

Assume the matrix has full rank .

For a fixed number of samples, the samples of are drawn iid from a distribution such that the number of unique realizations of is smaller than with probability and for .
The last assumption guarantees that the number of grouped examples is at least as large as the dimension of the style variables. If we have too few or no grouped examples (small ), we cannot estimate the conditional variance accurately. Under these assumptions we can prove adversarial risk consistency. [Adversarial risk consistency] Under model (12) and Assumption 1, with probability 1 with respect to the training data, the pooled estimator (3) has infinite adversarial loss
The CoRe estimator (8) with in (9) is adversarial loss consistent, in the sense that for ,
A proof is given in §A. The respective ridge penalties in both estimators (3) and (10) are assumed to be vanishing for the proof, but the proof can easily be generalized to include ridge penalties that vanish sufficiently fast for large sample sizes. The Lagrangian regularizer is assumed to be infinite for the CoRe estimator. Again, this could be generalized to finite values if the adversarial interventions are constrained to be in a region with finite norm. An equivalent result can be derived for misclassification loss instead of logistic loss, where the adversarial misclassification error of the pooled estimator is then 1 while the adversarial misclassification error of the CoRe estimator will converge to the optimal adversarial value.
6.2 Population limit: optimal robustness against increases of the stylenoise variance
We look at a partially linear version of the causal graph and least squares loss as a special case, using the marginalized version of the causal graph as in panel (c) of Figure 2. Let be a continuous target variable, an integervalued identity variable, and the style or orthogonal features and the observed vector . Let be independent meanzero random vectors in respectively with positive density on their respective support, variance for and nonsingular covariance for ^{5}^{5}5We can also add an independent noise term for but choose to omit it here to retain notational simplicity.. We look at the population limit of the CoRe estimator in its penalized form (10)
(13) 
where again is the expected conditional variance and . We analyze the case where interventions are random and follow the same distribution as the noise , just with a different scaling that can depend on the domain. Specifically, as a special case of the marginalized version of the causal graph in panel (c) of Figure 2, consider a partially linear version of (4) with a constant marginal distribution of in all domains
(14)  
for suitable functions , , and matrix . As mentioned above, the interventions are modeled as random interventions , where has the same distribution as but the two random variables are independent. The scaling is variable. In a standard setting, we might have for training data but we suppose that can increase in the future. In a new domain, for example, it might be larger. We would like to have a prediction of that works well even if the scaling of the ‘style noise’ increases substantially. Let denote the expectation under model (14) with parameter .
[Distributional robustness] Under model (14), the population CoRe estimator (13) is optimal against the class of distributions generated by varying the style noise level in ,
A proof is given in §B.
The CoRe estimator hence optimizes the worst case among all noise scalings of the style variable. The value of the penalty determines the level up to which we are protected when the noise variance increases. More precisely, a penalty is mimicking an increase in the variance of the noise in the style variable and allows using the current training data (with ) to optimize the loss under arbitrarily large values of the additional style variance . In this sense, there is a clear interpretation of the penalty factor in the CoRe estimator (10). Choosing means that we expect the variance of the style variable to remain unchanged, whereas using a strong penalty assumes that the variance of the style variable will grow very large in the future and the performance of the estimator will then not be affected even under arbitrarily strong interventions on the style variable.
A similar result is shown in Rothenhäusler et al. (2018)
who propose “anchor regression”. Anchor regression penalizes the ordinary least squares objective with a quantity that relies on socalled “anchors”
which are exogeneous variables. This quantity corresponds to the change in loss under shift interventions of a given strength. Anchor regression is shown to yield optimal predictive performance under such interventions. While the theoretical results have a similar form as the estimator is shown to be distributionally robust, they do not follow as corollaries from each other. Both CoRe and anchor regression rely on the presence of “auxiliary variables”—in CoRe, we exploit the grouping information given by while anchor regression relies on the exogeneous anchor variable . However, and play almost orthogonal roles. In anchor regression, the aim is to achieve predictive stability if the variance term explained by is varying. In CoRe, the aim is to retain the variance term explained by as we expect the variance under a constant to grow in the future. The interventions considered in anchor regression are shift interventions and it protects against specific distributional shifts up to a given strength. In Theorem 6.2, we consider noise interventions on the latent style variable.While Theorem 6.2 was derived for regression under squared error loss, a similar result can be obtained for classification under (truncated) squared error loss. The (truncated) quadratic loss is classificationcalibrated (Bartlett et al., 2003) and the truncation is even unnecessary in our case. For example, if , let be the predicted probability of , given . Taking a firstorder Taylor approximation of one can derive an analogous result to Theorem 6.2, where the approximation error of the Taylor expansion hinges on the magnitude of the future interventions and hence on the penalty level of the CoRe
estimator. For loss functions other than truncated squared error loss one could make a similar argument but one would have to use the conditional variance of the loss as a penalty as in (
11). This approach would then be similar to Namkoong and Duchi (2017), with the important difference that we work with conditional variances instead of unconditional variances. Conditioning on the variable is crucial in our context as we do not want to protect against general shifts in distribution but specifically against shifts in the distribution of the style variable. Conditioning on allows us to distinguish between the conditional variance caused by the unknown style variable (which we assume will change in the future) and the conditional variance caused by the randomness of (which we expect to stay constant in the future). Exploring the possibility of using the conditional variance of the loss instead of the prediction for general loss functions would be interesting followup work.7 Experiments
We perform an array of different experiments, showing the applicability and advantage of the conditional variance penalty for two broad settings:

Settings where we do not know what the style variables correspond to but still want to protect against a change in their distribution in the future. In the examples we show cases where the style variable ranges from brightness (§7.7), image quality (§7.2), movement (§7.3) and fashion (§7.4), which are all not known explicitly to the method. We also include genuinely unknown style variables in §7.1 (in the sense that they are unknown not only to the methods but also to us as we did not explicitly create the style interventions).

Settings where we do know what type of style interventions we would like to protect against. This is usually dealt with by data augmentation (adding images which are, say, rotated or shifted compared to the training data if we want to protect against rotations or translations in the test data (Schölkopf et al., 1996)). The conditional variance penalty is here exploiting that some augmented samples were generated from the same original sample and we use as variable the index of the original image. We show that this approach generalizes better than simply pooling the augmented data, in the sense that we need fewer augmented samples to achieve the same test error. This setting is shown in §7.5.
Details of the network architectures can be found in Appendix §C. All reported error rates are averaged over five runs of the respective method. A TensorFlow (Abadi et al., 2015) implementation of CoRe can be found at https://github.com/christinaheinze/core.
7.1 Eyeglasses detection with small sample size
We use a subsample of the CelebA dataset (Liu et al., 2015), without editing the images. We try to classify images according to whether the person in the image is wearing glasses or not. For construction of the variable we exploit the fact that several photos of the same person are available and set to be the identifier of the person in the dataset. Figure 4 shows examples from both the training and the test data set. The conditional variance is estimated across groups of observations that share a common , which here corresponds to pictures of the same person, where all pictures show the person either with glasses (if ) or all pictures show the person without glasses ().
The standard approach would be to pool all examples. The only additional information we exploit is that some observations can be grouped. We include identities in the training set, resulting in a total sample size as there are approximately 30 images of each person. If using a 5layer convolutional neural network (details can be found in Table C.1
) and pooling all data with a standard ridge penalty, the test error on unseen images is 24.76%. Using ImageNet pretrained features from Inception V3 does not yield lower error rates. Exploiting the group structure with the
CoRe penalty (in addition to a ridge penalty) reduces the test error to 16.89%. Results are not very sensitive to the specific choice of the penalty, as discussed further in D.6.The surprising aspect here is that both training and test data are drawn from the same distribution so we would not expect a distributional shift. The distributional shift in this example is caused by statistical fluctuations alone (by chance the background of eyeglass wearers might, for example, be darker in the training sample than test samples, the eyeglass wearers might be more outdoors, might be more women than men etc.). The following examples are more concerned with biases that will persist even if the number of training and test samples is very large. A second difference to the subsequent examples is the grouping structure—in this example, we consider only a few identities, namely , with a relatively large number of associated observations ( for all ). In the following examples, is much larger while is typically smaller than five.
7.2 Eyeglasses detection with known and unknown image quality intervention
We revisit the third example from §2. We again use the CelebA dataset and consider the problem of classifying whether the person in the image is wearing eyeglasses. In contrast to §7.1 we modify the images in the following way: in the training set and in test set 1, we sample the image quality^{6}^{6}6We use ImageMagick (https://www.imagemagick.org) to change the quality of the compression through convert quality q_ij input.jpg output.jpg where . for all samples
(all samples that show glasses) from a Gaussian distribution with mean
. Samples with (no glasses) are unmodified. In other words, if the image shows a person wearing glasses, the image quality tends to be lower. In test set 2, the quality is reduced in the same way for samples (no glasses), while images with are not changed. Figure 5 shows examples from the training set and test sets 1 and 2. This setting mimics the confounding that occurred in the Russian tank legend (cf. §1). For the CoRe penalty, we calculate the conditional variance across images that share the same if , that is across images that show the same person wearing glasses on all images. Observations with (not wearing glasses) are not grouped. Two examples are shown in the red box of Figure 5. Here, we have grouped observations among a total sample size of .Figure 5 shows misclassification rates for CoRe and the pooled estimator on test sets 1 and 2. The pooled estimator (only penalized with an penalty) achieves low error rates of 2% on test set 1, but suffers from a 65% misclassification error on test set 2, as now the relation between and the implicit variable (image quality) has been flipped. The CoRe estimator has a larger error of 13% on test set 1 as image quality as a feature is penalized by CoRe implicitly and the signal is less strong if image quality has been removed as a dimension. However, in test set 2 the performance of the CoRe estimator is 28% and improves substantially on the 65% error of the pooled estimator. The reason is again the same: the CoRe penalty ensures that image quality is not used as a feature to the same extent as for the pooled estimator. This increases the test error slightly if the samples are generated from the same distribution as training data (as here for test set 1) but substantially improves the test error if the distribution of image quality, conditional on the class label, is changed on test data (as here for test set 2).
Eyeglasses detection with known image quality intervention
To compare to the above results, we repeat the experiment by changing the grouped observations as follows. Above, we grouped images that had the same person when . We refer to this scheme of grouping observations with the same as ‘Grouping setting 2’. Here, we use an explicit augmentation scheme and augment images with in the following way: each image is paired with a copy of itself and the image quality is adjusted as described above. In other words, the only difference between the two images is that image quality differs slightly, depending on the value that was drawn from the Gaussian distribution with mean and standard deviation , determining the strength of the image quality intervention. Both the original and the copy get the same value of identifier variable . We call this grouping scheme ‘Grouping setting 1’. Compare the left panels of Figures 5 and 6 for examples.
While we used explicit changes in image quality in both above and here, we referred to grouping setting 2 as ‘unknown image quality interventions’ as the training sample as in the left panel of Figure 5 does not immediately reveal that image quality is the important style variable. In contrast, the augmented data samples (grouping setting 1) we use here differ only in their image quality for a constant .
Figure 6 shows examples and results. The pooled estimator performs more or less identical to the previous dataset. The explicit augmentation did not help as the association between image quality and whether eyeglasses are worn is not changed in the pooled data after including the augmented data samples. The misclassification error of the CoRe estimator is substantially better than the error rate of the pooled estimator. The error rate on test set 2 of 13% is also improving on the rate of 28% of the CoRe estimator in grouping setting 2. We see that using grouping setting 1 works best since we could explicitly control that only varies between grouped examples. In grouping setting 2, different images of the same person can vary in many factors, making it more challenging to isolate image quality as the factor to be invariant against.
7.3 Stickmen imagebased age classification with unknown movement interventions
In this example we consider synthetically generated stickmen images; see Figure 7 for some examples. The target of interest is . The core feature is here the height of each person. The class is causal for height and height cannot be easily intervened on or change in different domains. Height is thus a robust predictor for differentiating between children and adults. As style feature we have here the movement of a person (distribution of angles between body, arms and legs). For the training data we created a dependence between age and the style feature ‘movement’, which can be thought to arise through a hidden common cause , namely the place of observation. The data generating process is illustrated in Figure D.6. For instance, the images of children might mostly show children playing while the images of adults typically show them in more “static” postures. The left panel of Figure 7 shows examples from the training set where large movements are associated with children and small movements are associated with adults. Test set 1 follows the same distribution, as shown in the middle panel. A standard CNN will exploit this relationship between movement and the label of interest, whereas this is discouraged by the conditional variance penalty of CoRe. The latter is pairing images of the same person in slightly different movements as shown by the red boxes in the leftmost panel of Figure 7. If the learned model exploits this dependence between movement and age for predicting , it will fail when presented images of, say, dancing adults. The right panel of Figure 7 shows such examples (test set 2). The standard CNN suffers in this case from a 41% misclassification rate, as opposed to the 3% on test set 1 data. For as few as paired observations, the network with an added CoRe penalty, in contrast, achieves also 4% on test set 1 data and succeeds in achieving an 9% performance on test set 2, whereas the pooled estimator fails on this dataset with a test error of 41%.
These results suggest that the learned representation of the pooled estimator uses movement as a predictor for age while CoRe does not use this feature due to the conditional variance regularization. Importantly, including more grouped examples would not improve the performance of the pooled estimator as these would be subject to the same bias and hence also predominantly have examples of heavily moving children and “static” adults (also see Figure D.7 which shows results for ).
7.4 Gender classification with unknown confounding
We work again with the CelebA dataset. This time we consider the problem of classifying whether the person in the image is male or female. We create a confounding on training and test set 1 by including mostly images of men wearing glasses and women not wearing glasses. In test set 2 the association between gender and glasses is flipped: women always wear glasses while men never wear glasses. Examples from the training and test sets 1 and 2 are shown in Figure 8.
To compute the conditional variance penalty, we use again images of the same person. The variable is, in other words, the identity of the person and gender is constant across all examples that have a constant . Conditioning on is hence identical to conditioning on alone. Another difference to the other experiments is that we consider a binary style feature here.
For this example, we computed the relevant results both with a 5layer CNN if trained endtoend as well as for using Inception V3 pretrained features and retraining the last softmax layer. Interestingly, the results do not change much and both models lead to misclassification error rates above 40% for test set 2 data and
paired examples. Adding the CoRe penalty has the desired effect in both models, as the performance is much more stable across all data sets. Additional results for different sample sizes and different numbers of paired examples can be found in Appendix §D.2.7.5 MNIST: more sample efficient data augmentation
The goal of using CoRe in this example is to make data augmentation more efficient in terms of the required samples. In data augmentation, one creates additional samples by modifying the original inputs, e.g. by rotating, translating, or flipping the images (Schölkopf et al., 1996). In other words, additional samples are generated by interventions on style features. Using this augmented data set for training results in invariance of the estimator with respect to the transformations (style features) of interest. For CoRe we can use the grouping information that the original and the augmented samples belong to the same object. This enforces the invariance with respect to the style features more strongly compared to normal data augmentation which just pools all samples. We assess this for the style feature ‘rotation’ on MNIST (LeCun et al., 1998) and only include augmented training examples for original samples, resulting in a total sample size of . The degree of the rotations is sampled uniformly at random from . Figure 9 shows examples from the training set. By using CoRe the average test error on rotated examples is reduced from 22% to 10%. Very few augmented sample are thus sufficient to lead to stronger rotational invariance. The standard approach of creating augmented data and pooling all images requires, in contrast, many more samples to achieve the same effect. Additional results for and ranging from 100 to 5000 can be found in Figure D.5 in Appendix §D.3.
7.6 Elmer the Elephant
In this example, we want to assess whether invariance with respect to the style feature ‘color’ can be achieved. In the children’s book ‘Elmer the elephant’^{7}^{7}7https://en.wikipedia.org/wiki/Elmer_the_Patchwork_Elephant one instance of a colored elephant suffices to recognize it as being an elephant, making the color ‘gray’ no longer an integral part of the object ‘elephant’. Motivated by this process of concept formation, we would like to assess whether CoRe can exclude ‘color’ from its learned representation by penalizing conditional variance appropriately.
We work with the ‘Animals with attributes 2’ (AwA2) dataset (Xian et al., 2017) and consider classifying images of horses and elephants. We include additional examples by adding grayscale images for images of elephants. These additional examples do not distinguish themselves strongly from the original training data as the elephant images are already close to grayscale images. The total training sample size is 1850.
Figure 10 shows examples and misclassification rates from the training set and test sets for CoRe and the pooled estimator on different test sets. Examples from these and more test sets can be found in Figure D.10. Test set 1 contains original, colored images only. In test set 2 images of horses are in grayscale and the colorspace of elephant images is modified, effectively changing the color gray to redbrown. We observe that the pooled estimator does not perform well on test set 2 as its learned representation seems to exploit the fact that ‘gray’ is predictive for ‘elephant’ in the training set. This association is no longer valid for test set 2. In contrast, the predictive performance of CoRe is hardly affected by the changing color distributions. More details can be found in Appendix §D.6.
It is noteworthy that a colored elephant can be recognized as an elephant by adding a few examples of a grayscale elephant to the very lightly colored pictures of natural elephants. If we just pool over these examples, there is still a strong bias that elephants are gray. The CoRe estimator, in contrast, demands invariance of the prediction for instances of the same elephant and we can learn color invariance with a few added grayscale images.
7.7 Eyeglasses detection: unknown brightness intervention
As in §7.2 we work with the CelebA dataset and try to classify whether the person in the image is wearing eyeglasses. Here we analyze a confounded setting that could arise as follows. Say the hidden common cause of and
is a binary variable and indicates whether the image was taken outdoors or indoors. If it was taken outdoors, then the person tends to wear (sun)glasses more often and the image tends to be brighter. If the image was taken indoors, then the person tends not to wear (sun)glasses and the image tends to be darker. In other words, the style variable
is here equivalent to brightness and the structure of the data generating process is equivalent to the one shown in Figure D.6. Figure 11 shows examples from the training set and test sets. As previously, we compute the conditional variance over images of the same person, sharing the same class label (and the CoRe estimator is hence not using the knowledge that brightness is important). Two alternatives for constructing grouped observations in this setting are discussed in §D.1. We use and. For the brightness intervention, we sample the value for the magnitude of the brightness increase resp. decrease from an exponential distribution with mean
. In the training set and test set 1, we sample the brightness value as where and , where indicates presence of glasses and indicates absence.^{8}^{8}8Specifically, we use ImageMagick (https://www.imagemagick.org) and modify the brightness of each image by applying the command convert modulate b_ij,100,100 input.jpg output.jpg to the image. For test set 2, we use instead , so that the relation between brightness and glasses is flipped.Figure 11 shows misclassification rates for CoRe and the pooled estimator on different test sets. Examples from all test sets can be found in Figure D.2. First, we notice that the pooled estimator performs better than CoRe on test set 1. This can be explained by the fact that it can exploit the predictive information contained in the brightness of an image while CoRe is restricted not to do so. Second, we observe that the pooled estimator does not perform well on test set 2 as its learned representation seems to use the image’s brightness as a predictor for the response which fails when the brightness distribution in the test set differs significantly from the training set. In contrast, the predictive performance of CoRe is hardly affected by the changing brightness distributions. Results for and can be found in Figure D.3 in Appendix §D.1.
8 Conclusion
Distinguishing the latent features in an image into core and style features, we have proposed conditional variance regularization (CoRe) to achieve robustness with respect to arbitrarily large interventions on the style or conditionally invariant features. The main idea of the CoRe estimator is to exploit the fact that we often have instances of the same object in the training data. By demanding invariance of the classifier amongst a group of instances that relate to the same object, we can achieve invariance of the classification performance with respect to adversarial interventions on style features such as image quality, fashion type, color, or body posture. The training also works despite sampling biases in the data.
There are two main applications areas:

If the style features are known explicitly, we can achieve the same classification performance as standard data augmentation approaches with substantially fewer augmented samples, as shown for example in §7.5. Additionally, the augmented images do not need to be balanced carefully for the CoRe estimator, as shown for example in §7.6, where adding grayscale images to a set of grayish elephants leads to invariance to color with the CoRe approach while a pooled estimator is still using color to predict the animal class with the same dataset.

Perhaps more interesting are settings in which it is unknown what the style features are, with examples in §7.1, §7.2, §7.3, §7.4, and §7.7. CoRe regularization forces predictions to be based on features that do not vary strongly between instances of the same object. We could show in the examples and in Theorems 1 and 6.2 that this regularization achieves distributional robustness with respect to changes in the distribution of the (unknown) style variables.
An interesting line of work would be to use larger models such as Inception or large ResNet architectures (Szegedy et al., 2015; He et al., 2016). These models have been trained to be invariant to an array of explicitly defined style features. In §7.4 we include results which show that using Inception V3 features does not guard against interventions on more implicit style features. We would thus like to assess what benefits CoRe can bring for training Inceptionstyle models endtoend, both in terms of sample efficiency and in terms of generalization performance. While we showed some examples where the necessary grouping information is available, an interesting possible future direction would be to use video data since objects display temporal constancy and the temporal information can hence be used for grouping and conditional variance regularization.
Acknowledgments
We thank Brian McWilliams, Jonas Peters, and Martin Arjovsky for helpful comments and discussions and CSCS for provision of computational resources. A preliminary version of this work was presented at the NIPS 2017 Interpretable ML Symposium and we thank participants of the symposium for very helpful discussions.
References
 Abadi et al. (2015) M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.
 Aldrich (1989) J. Aldrich. Autonomy. Oxford Economic Papers, 41:15–34, 1989.

Bagnell (2005)
J. Bagnell.
Robust supervised learning.
InProceedings of the national conference on artificial intelligence
, volume 20, page 714. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2005.  Bahadori et al. (2017) M. T. Bahadori, K. Chalupka, E. Choi, R. Chen, W. F. Stewart, and J. Sun. Causal regularization. ArXiv eprints, 2017. URL http://arxiv.org/abs/1702.02604.
 Barocas and Selbst (2016) S. Barocas and A. D. Selbst. Big Data’s Disparate Impact. 104 California Law Review 671, 2016.
 Bartlett et al. (2003) P.L. Bartlett, M.I. Jordan, and J.D. McAuliffe. Convexity, classification, and risk bounds. Technical report, Department of Statistics, U.C. Berkeley, 2003.
 Belkin et al. (2006) M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of machine learning research, 7(Nov):2399–2434, 2006.
 BenDavid et al. (2007) S. BenDavid, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems 19. 2007.
 BenTal et al. (2013) A. BenTal, D. Den Hertog, A. De Waegenaere, B. Melenberg, and G. Rennen. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341–357, 2013.
 Besserve et al. (2017) M. Besserve, N. Shajarisales, B. Schölkopf, and D. Janzing. Group invariance principles for causal generative models. ArXiv eprints, 2017. URL http://arxiv.org/abs/1705.02212.
 Bolukbasi et al. (2016) T. Bolukbasi, K.W. Chang, J. Y. Zou, V. Saligrama, and A. T. Kalai. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in Neural Information Processing Systems 29. 2016.
 Bouchacourt et al. (2017) D. Bouchacourt, R. Tomioka, and S. Nowozin. Multilevel variational autoencoder: Learning disentangled representations from grouped observations. ArXiv eprints, 2017. URL http://arxiv.org/abs/1705.08841.
 Chalupka et al. (2014) K. Chalupka, P. Perona, and F. Eberhardt. Visual Causal Feature Learning. Uncertainty in Artificial Intelligence, 2014.
 Chen et al. (2016) X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets. In Advances in Neural Information Processing Systems 29. 2016.
 Crawford (2016) K. Crawford. Artificial intelligence’s white guy problem. The New York Times, June 25 2016, 2016. URL https://www.nytimes.com/2016/06/26/opinion/sunday/artificialintelligenceswhiteguyproblem.html.

Csurka (2017)
G. Csurka.
A comprehensive survey on domain adaptation for visual applications.
In
Domain Adaptation in Computer Vision Applications.
, pages 1–35. 2017.  Denton and Birodkar (2017) E. L. Denton and V. Birodkar. Unsupervised learning of disentangled representations from video. In Advances in Neural Information Processing Systems 30. 2017.
 Emspak (2016) J. Emspak. How a machine learns prejudice. Scientific American, December 29 2016, 2016. URL https://www.scientificamerican.com/article/howamachinelearnsprejudice/.
 Ganin et al. (2016) Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky. Domainadversarial training of neural networks. Journal of Machine Learning Research, 17(1), 2016.
 Gao et al. (2017) R. Gao, X. Chen, and A. Kleywegt. Wasserstein distributional robustness and regularization in statistical learning. arXiv preprint arXiv:1712.06050, 2017.
 Gong et al. (2016) M. Gong, K. Zhang, T. Liu, D. Tao, C. Glymour, and B. Schölkopf. Domain adaptation with conditional transferable components. In International Conference on Machine Learning, 2016.
 Goodfellow et al. (2015) I. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.
 Goudet et al. (2017) O. Goudet, D. Kalainathan, P. Caillou, D. LopezPaz, I. Guyon, M. Sebag, A. Tritas, and P. Tubaro. Learning Functional Causal Models with Generative Neural Networks. ArXiv eprints, 2017. URL https://arxiv.org/abs/1709.05321.
 Haavelmo (1944) T. Haavelmo. The probability approach in econometrics. Econometrica, 12:S1–S115 (supplement), 1944.
 He et al. (2015) K. He, X. Zhang, S. Ren, and J. Sun. Delving Deep into Rectifiers: Surpassing HumanLevel Performance on ImageNet Classification. ICCV, 2015.
 He et al. (2016) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CVPR, 2016.
 Higgins et al. (2017) I. Higgins, L. Matthey, A. Pal, C. Burges, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. betaVAE: Learning Basic Visual Concepts with a Constrained Variational Framework. International Conference on Learning Representations, 2017.
 Kilbertus et al. (2017) N. Kilbertus, M. RojasCarulla, G. Parascandolo, M. Hardt, D. Janzing, and B. Schölkopf. Avoiding discrimination through causal reasoning. Advances in Neural Information Processing Systems, 2017.
 Kingma and Ba (2015) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, 2015.
 Kocaoglu et al. (2017) M. Kocaoglu, C. Snyder, A. G. Dimakis, and S. Vishwanath. CausalGAN: Learning Causal Implicit Generative Models with Adversarial Training. ArXiv eprints, 2017. URL https://arxiv.org/abs/1709.02023.
 Krizhevsky et al. (2012) A. Krizhevsky, I. Sutskever, and G. E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25. 2012.
 LeCun et al. (1998) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 1998.
 Liu et al. (2015) Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
 LopezPaz and Oquab (2017) D. LopezPaz and M. Oquab. Revisiting Classifier TwoSample Tests. International Conference on Learning Representations (ICLR), 2017.

LopezPaz et al. (2017)
D. LopezPaz, R. Nishihara, S. Chintala, B. Schölkopf, and L. Bottou.
Discovering causal signals in images.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017)
, 2017.  Louizos et al. (2017) C. Louizos, U. Shalit, J. M. Mooij, D. Sontag, R. Zemel, and M. Welling. Causal effect inference with deep latentvariable models. Advances in Neural Information Processing Systems, 2017.
 M. RojasCarulla (2017) R. Turner J. Peters M. RojasCarulla, B. Schölkopf. Causal transfer in machine learning. ArXiv eprints, 2017. URL https://arxiv.org/abs/1507.05333.
 Magliacane et al. (2017) S. Magliacane, T. van Ommen, T. Claassen, S. Bongers, P. Versteeg, and J. M. Mooij. Causal transfer learning. ArXiv eprints, 2017. URL https://arxiv.org/abs/1707.06422.
 Matsuo et al. (2017) T. Matsuo, H. Fukuhara, and N. Shimada. Transform invariant autoencoder. ArXiv eprints, 2017. URL http://arxiv.org/abs/1709.03754.
 Mirza and Osindero (2014) M. Mirza and S. Osindero. Conditional Generative Adversarial Nets. ArXiv eprints, 2014. URL https://arxiv.org/abs/1411.1784.
 Namkoong and Duchi (2017) H. Namkoong and J.C. Duchi. Variancebased regularization with convex objectives. In Advances in Neural Information Processing Systems, pages 2975–2984, 2017.
 Pearl (2009) J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, USA, 2nd edition, 2009.

Peters et al. (2016)
J. Peters, P. Bühlmann, and N. Meinshausen.
Causal inference using invariant prediction: identification and confidence intervals.
Journal of the Royal Statistical Society, Series B (with discussion), to appear, 2016.  QuioneroCandela et al. (2009) J. QuioneroCandela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence. Dataset Shift in Machine Learning. The MIT Press, 2009.
 Richardson and Robins (2013) T. Richardson and J. M. Robins. Single world intervention graphs (SWIGs): A unification of the counterfactual and graphical approaches to causality. Center for the Statistics and the Social Sciences, University of Washington Series. Working Paper 128, 30 April 2013, 2013.
 Rothenhäusler et al. (2018) D. Rothenhäusler, P. Bühlmann, N. Meinshausen, and J. Peters. Anchor regression: heterogeneous data meets causality. arXiv preprint arXiv:1801.06229, 2018.
 Schölkopf et al. (2012) B. Schölkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, and J. Mooij. On causal and anticausal learning. In Proceedings of the 29th International Conference on Machine Learning (ICML), pages 1255–1262, 2012.
 Schölkopf et al. (1996) B. Schölkopf, C. Burges, and V. Vapnik. Incorporating invariances in support vector learning machines. pages 47–52. Springer, 1996.
 ShafieezadehAbadeh et al. (2017) S. ShafieezadehAbadeh, D. Kuhn, and P. Esfahani. Regularization via mass transportation. arXiv preprint arXiv:1710.10016, 2017.
 Sinha et al. (2017) A. Sinha, H. Namkoong, and J. Duchi. Certifiable distributional robustness with principled adversarial training. arXiv preprint arXiv:1710.10571, 2017.
 Szegedy et al. (2014) C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.
 Szegedy et al. (2015) C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR), 2015.
 Torralba and Efros (2011) A. Torralba and A. A. Efros. Unbiased look at dataset bias. In Computer Vision and Pattern Recognition (CVPR), 2011.
 X. Yu (2017) M. Gong K. Zhang D. Tao X. Yu, T. Liu. Transfer learning with label noise. ArXiv eprints, 2017. URL https://arxiv.org/abs/1707.09724.
 Xian et al. (2017) Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata. Zeroshot learning  A comprehensive evaluation of the good, the bad and the ugly. ArXiv eprints, 2017. URL http://arxiv.org/abs/1707.00600.
 Xu et al. (2009) H. Xu, C. Caramanis, and S. Mannor. Robust regression and lasso. In Advances in Neural Information Processing Systems, pages 1801–1808, 2009.
 Yudkowsky (2008) E. Yudkowsky. Artificial intelligence as a positive and negative factor in global risk. Global catastrophic risks, 1, 2008.
 Zhang et al. (2013) K. Zhang, B. Schölkopf, K. Muandet, and Z. Wang. Domain adaptation under target and conditional shift. In International Conference on Machine Learning, 2013.
 Zhang et al. (2015) K. Zhang, M. Gong, and B. Schölkopf. Multisource domain adaptation: A causal view. In Proceedings of the TwentyNinth AAAI Conference on Artificial Intelligence, 2015.
Appendix A Proof of Theorem 1
First part. To show the first part, namely that with probability 1,
we need to show that with probability 1. The reason this is sufficient is as follows: if , then as we can then find a such that . Setting for , we get . Hence for either or .
To show that with probability 1, let be the oracle estimator that is constrained to be orthogonal to the column space of :
(15) 
We show by contradiction. Assume hence that . If this is indeed the case, then the constraint in (15) becomes nonactive and we have . This would imply that taking the directional derivative of the training loss with respect to any in the column space of should vanish at the solution . In other words, define the gradient as . The implication is then that for all in the columnspace of ,
(16) 
and we will show the latter condition is violated.
As we work with the logistic loss and , the loss is given by Define . For all we have . Then
(17) 
Let for be training data in absence of any interventions, that is under . We call these data in the following the (counterfactual) interventionfree training data. Since the interventions only have an effect on the column space of in , the oracle estimator is identical under the true training data and the interventionfree training data . By assumption, and (17) can hence also be written as
(18) 
Since is in the columnspace of , there exists such that and we can write (18) as
(19) 
From (A2) we have that the eigenvalues of
are all positive. Also is not a function of the interventions since, as above, the estimator is identical whether trained on the original data or on the interventionfree data . If we condition on everything except for the random interventions by conditioning on for , then the rhs of (19) can be written aswhere is fixed (again conditional on the interventionfree training data) and is a random vector and with probability 1 as the interventions
Comments
There are no comments yet.