1 Introduction
Current machine learning models often fail to generalize when domain distributions of testing data differ from the training ones. This phenomenon has been repeatedly witnessed and intentionally exposed in many examples
[45, 40, 15, 33, 17]. Among the explanations, shortcut learning [14] is considered as a main factor causing this phenomenon. A nice example is about the classification of images of cows and camels — a trained convolutional network tends to recognize cows or camels by learning spurious features from image backgrounds (e.g., green pastures for cows and deserts for camels), rather than learning the causal shape features of the animals [5]; decisions based on the spurious features would make the learned models fail when cows or camels appear in unusual, different environments. Machine learning models are expected to have the capability of outofdistribution (OOD) generalization and avoid shortcut learning.To achieve OOD generalization, recent theories [3, 22, 2, 37, 1] are motivated by causality literature [34, 36], and resort to extraction of the invariant, causal features and establishing the relevant conditions under which machine learning models have the guaranteed generalization. Among these works, invariant risk minimization (IRM) [3] is a notable learning paradigm that incorporates the invariance principle [35] into practice. In spite of the theoretical promise of IRM, it is only applicable to problems of linear regression. For other problems such as linear classification, Ahuja et al. [1] first show that for OOD generalization, linear classification is more difficult in the case when invariant features capture all information about the label, and propose a new learning method of information bottleneckbased invariant risk minimization (IBIRM). In this work, we closely investigate the conditions identified in [1] and propose improved results for OOD generalization of linear classification. Our technical contributions are as follows.
Contributions. In [1], a notion of support overlap of invariant features is assumed in order to make the OOD generalization of linear classification successful. In this work, we first show that this assumption is rather strong and it is still possible to achieve such goal without this assumption. Then, we examine whether the IBIRM proposed in [1] is sufficient to learn invariant features for linear classification, and find that IBIRM still fails in several cases whether or not the invariant features capture all information of the label. We then analyze two failure modes of IBIRM, in particular when the spurious features in training environments capture sufficient information for the label but less than the invariant features. Based on the above analyses, we propose a new method, termed counterfactual supervisionbased information bottleneck (CSIB), to address such failures. We prove that, under the proposed weaker assumptions, CSIB is theoretically guaranteed for the success of OOD generalization in linear classification. Notably, CSIB works even when accessing data from a single environment, and has theoretically consistent results for both binary and multiclass problems. Finally, we design three synthetic datasets based on our used motivating examples; experiments verify our proposed method empirically. All the proofs and details of experiments are given in the appendices.
2 OOD generalization for linear classification: background and failures
We consider the same learning setting as in [3, 1], but focus on the linear classification problem only. Let be the training data gathered from a set of training environments , where is the dataset from environment with each instance i.i.d. drawn from . Let and (or ) be the support sets of the input feature values and output labels in the environment , respectively. Given observed data , the goal of OOD generalization is to find a predictor from such that it can perform well across a large set of unseen but related environments . Formally, it is expected to minimize
(1) 
where is the risk under the environment with
the well defined loss function. Clearly, without any restrictions on
, it is impossible to achieve OOD generalization. We first follow the structure equation model (SEM) used in [1] and review some results it has made.Assumption 1 (Linear classification SEM (FIIF)).
In each
(2)  
where with
is the labeling hyperplane,
, , is binary noise with identical distribution across environments, is the XOR operator, is invertible, and if otherwise .Following [1], we say the invariant features in a directed acyclic graph (DAG) are called the fully informative invariant features (FIIF) if we have . Otherwise we call it partially informative invariant features (PIIF). The above assumption shows how the environment data are generated from the latent invariant features and spurious features , and obviously, the invariant features in Assumption 1 are fully informative to . The DAG corresponding to this assumption is illustrated in Fig. 1.
(a) Example 1. (b) Example 2. (c) Example 3. (d) Here, we assume we have only two environments in total. The blue and black regions represent the support set in training and test environments respectively. Although their support sets do not overlap, a zeroerror classifier on the training environment data would clearly make the test error zero, thus enables the OOD generalization.
To solve Eq. (1), the objectives of IRM [3] and IBIRM [1] are listed as follows (we consider a homogenous linear classifier here for convenience):
(3) 
(4) 
where with the the Shannon entropy (or a lower bounded differential entropy) and is the threshold on the average risk. If we drop the invariance constraint from IRM and IBIRM, we get standard empirical risk minimization (ERM) and information bottleneckbased empirical risk minimization (IBERM) respectively. The use of entropy constraint in IBIRM is inspired from the information bottleneck principle [47] where mutual information is used for information compression. Since the representation is a deterministic mapping of , minimizing the entropy of is equivalent to minimizing the mutual information . In brief, the optimization of IBIRM is to pick the one that has the least entropy among all highly predictive invariant predictors. Let us define the support set of the invariant (resp., spurious) features (resp., ) in environment as (resp., ).
Assumption 2 (Bounded invariant features).
is a bounded set^{1}^{1}1A set is bounded if such that ..
Assumption 3 (Bounded spurious features).
is a bounded set.
Assumption 4 (Invariant feature support overlap).
.
Assumption 5 (Strictly separable invariant features).
Under the above assumptions, we now present the main OOD generalization results from [1] for linear classification.
Theorem 1 (Impossibility of guaranteed OOD generalization for linear classification [1]).
Based on the above theorem, it is claimed in [1] that without the support overlap assumption (Assumption 4) on the invariant features, OOD generalization is impossible for linear classification and therefore requires the Assumption 4 for all of the rest analyses. However, we would show that such an assumption is rather strong. For example, consider a generalization task in a single environment without spurious features, requiring the assumption of support overlap between training data and test data would promote the algorithm for memorization instead of generalization, which is a little trivial to our topic. With such concerns, we propose a much weaker assumption that is highly related to the learning hypothesis space and connects the traditional generalization theory [42]. To help the illustration, we first give an intuitive example as shown in Figure 2, where the learning hypothesis space is the set of linear hyperplanes; clearly, although the support sets of invariant features between training and test environments are different, it would be still possible for the success of OOD generalization if the invariant features are learned.
Let be the mixture distribution of invariant features in the training environments, , and (denote as 01 loss for convenience). We now present a weaker assumption to the invariant features.
Assumption 6.
, where with the 01 loss function function.
Clearly, under the assumption of separable invariant features (Assumption 5), if Assumption 4 holds, Assumption 6 would also hold, but not vice versa. We would show that Assumption 6 can be substituted for Assumption 4 for the guarantee of OOD generalization in our proposed method later. Before that, we review another main result presented in [1].
Theorem 2 (IBIRM and IBERM vs. IRM and ERM [1]).
Suppose each follows Assumption 1. Assume that the invariant features are strictly separable, bounded, and satisfy support overlap, i.e., Assumptions 2, 4, and 5 hold. Also, for each , where , is continuous (or discrete with each component at least two distinct values), bounded, and zero mean noise. Each solution to IBIRM (Eq. (4) with as 01 loss and ), and IBERM solves the OOD generalization (Eq. (1)) but ERM and IRM (Eq. (3)) fail.
We would show that IBIRM could fail in many cases; we here present an illustrating example of such a failure.
Example 1.
Following Assumption 1, and let with and varies in different environments. For any environment , we assume that the distribution of and does not change. As shown in Fig. 2, and is a discrete distribution uniformly on six points ((5,4),(5,5),(5,6),(5,4),(5,5),(5,6)). Now we can construct the training data of two environments:
Then, by applying IBIRM to the above example with as 01 loss and , we would get a model of as . Consider the prediction made by this model as (we ignore classifier bias here for convenience)
(5) 
It is trivial to show that the of and is an invariant predictor across training environments with classification error , and it achieves the least entropy of for each training environment , and therefore, it is a solution of IBIRM. However, since the predictor of relies on spurious features that may change arbitrarily on unseen environments, it thus fails to solve the OOD generalization problem (Eq. (1)).
Understanding the failures: Although the Example 1 satisfies Assumptions 2, 3, 4, and 5, we find that it is still insufficient for IBIRM to success, appearing as a contradiction to Theorem 2. It is worth to note that this contradiction is due to the different condition of . This is because, in Example 1, is a delta distribution or discrete distribution supported on only one point, while in Theorem 2, it requires the to be a variable that has at least two distinct values with zero mean noise in each component, which usually does not hold in practice especially when the dimension of spurious features is high. For example, if we replace the in Example 1 with two distinct values of and
, with probability 0.5 each, and set two environments of
and , we can still conclude that IBIRM may fail by seting and . Then, what is the true reason behind this? We state that the reason is that the invariant features are not the minimal sufficient statistics for the label. In this case, IBIRM may fail when the spurious features in the training environments capture less information but sufficient for the label. We would formally state that as follow.Theorem 3.
Following Assumption 1, assume that a) the invariant features are strictly separable, bounded, and satisfy support overlap (Assumptions 2, 4, and 5 hold), b) the invariant features are discrete variables with at least three distinct points (not on the same line) and support overlap in each training environment. Then, there exits at least one pair of such that the transformed variables are still bounded and satisfy support overlap, and satisfy and for any .
When there exists a component of (assume the th component ) such that for all training environments, is a delta distribution or discrete distribution supported on only one point, then we have for any . From the above theorem, we can see that when the spurious features in the training environments include the information and supports on only one point for some , IBIRM or IBERM methods would prefer to select these spurious features (also with a nonzero weight to ) due to the smaller entropy, and thus may fail to generalize to unseen environments where spurious features change (due to the change of ). Note that we only analyse the discrete distribution of here for convenience. The conclusions for the continuous distribution are similar.
In the above example, we have shown a failure case of IBIRM when the condition assumed on is lightly violated. We now move to another failure mode of IBIRM even when this assumption is satisfied, i.e., is zeromean with each component at least two distinct points, on both the FIIF and PIIF cases. We first show the FIIF case below.
Example 2.
Following Assumption 1, and let with and , and is zeromean with at least two distinct points, where and may vary in different environments. For any environment , we assume that the distribution of does not change. As shown in Fig. 2, is the generated classifier. Now we can construct the training data of two environments:
and set or with probability 0.5 each in .
Then, by applying IBIRM to the above example with as 01 loss and , we would get a model of as . Consider the prediction made by this model as (we ignore the bias here for convenience)
(6) 
It is trivial to show that the of and is an invariant predictor across training environments with classification error , and it achieves the least entropy of for each training environment , and therefore, it is a solution of IBIRM. However, since the predictor of relies on spurious features which may change arbitrarily on unseen environments, it thus fails to solve the OOD generalization problem (Eq. (1)).
We now move to the PIIF case as follows.
Assumption 7 (Linear classification SEM (PIIF)).
In each
(7)  
where with is the labeling hyperplane, , , , is binary noise with identical distribution across environments, is continuous (or discrete variable with each component at least two distinct values), bounded, and zero mean, which varies in different environments, is the XOR operator, and is invertible.
The DAG of the above assumption can be seen in Fig. 1.
Example 3.
Then, by applying IBIRM to the above example with as 01 loss and , we would get a model of as . Consider the prediction made by this model as (we ignore classifier bias here for convenience)
(8) 
Next, we would show that and , and therefore IBIRM fails to address the OOD generalization problem (Eq. (1)). Clearly, the results to the above example can be divided into four categories: case 1): and ; case 2): and ; case 3): and ; case 4): and .
First, it is trivial to show that the case 4) is impossible since the classification error of such predictor is , and for each of the rest of three cases, the classification error would be equal to or smaller than . For example, for the case 1): and , the error is 0; for the case 2): and , the error is ; for the case 3): and , the error is .
Second, we would show that the resulting predictor of case 1) of and is an invariant predictor across training environments. This is because, the predictor of and would make invariant across two training environments for any , and thus is a invariant predictor (see Appendix for details of the proof).
Finally, it is trivial to show that the predictor of case 1 has the least entropy among them of cases 13. Clearly, in case 1): ; in case 2) and 3): for each environment e.
Understanding the failures: In the above two examples, the failure of invariance constraint for removing the spurious features out is because the spurious features among all training environments generated from different label values are strictly linearly separable. This could make the predictor relying only on spurious features achieve zero training error and thus be an invariant predictor across training environments. Since the label set is finite (with only two values in binary classification) in classification problems, such phenomenon may exist, while it would not happen in regression problems. We state such failure mode formally as below.
Theorem 4.
The understanding of Theorem 4
is very intuitive since when the spurious features in the training environments with different labels are linearly separable, there is no algorithm that can distinguish spurious features from invariant features. Although the assumption seems strong for this failure, we would show in Appendix that for highdimensional data, i.e.,
is large ( common cases in practice such as image data), if the number of environments , we would have high probability that the conditions in Theorem 4 will satisfy and thus OOD generalization will fail by optimizing IBIRM. This is because, in dimensional space, we would have high probability that randomly drawn distinct points are linearly separable for any two subsets.3 Counterfactual Supervisionbased information bottleneck
In the above analyses, we have shown several failure cases of IBIRM for OOD generalization in the linear classification problem. The key failure is due to the learned features that only rely on spurious features. To prevent such failure, we present counterfactual supervisionbased information bottleneck (CSIB) learning principle for removing the spurious features iteratively.
Overall, the CSIB first uses IBERM to extract features, then we apply two counterfactual interventions on the learned features of any single example and get it two counterfactual examples. If these two counterfactual examples have the same class by human supervision, it is possible that the learned features are spurious features. Then we remove such spurious features and apply the IBERM again until only invariant features are learned. The details of CSIB algorithm are illustrated below (let
be an identical matrix initially):
Step 1 (IBERM): Apply IBERM algorithm to all the training environment data as:
(9) 
with the 01 loss function.
Step 2 (SVD decomposition): Assume and are the feature extractor and classifier learned by IBERM and the rank of is
. We first do singular value decomposition (SVD) to
and getwith orthogonal matrixes
and , which can be partitioned by with and is the diagonal matrix with nonzero elements, and and .Step 3 (Counterfactual supervision): Pick a random sample from training environment with ground truth label . Assume that input data are bounded by , then construct two new features and by the following operation: and ; and . Backward the new features and to the input space as and . If the label of is different from that of by human supervision, then end the algorithm and return feature extractor and classifier , otherwise set , and update the environment data variable for each and then go to the Step 1.
Theorem 5 (Guarantee of CSIB).
Suppose each follows Assumption 1 or 7 with the orthogonal (invertible) transformation. Assume that the invariant features are strictly separable (5), bounded (Assumptions 2), and satisfy Assumptions 6. Also, for each in Assumption 1, with and or with and , where is continuous (or discrete variable with each component at least two distinct values), bounded, and zero mean. Each solution to CSIB with as 01 loss, , and solves the OOD generalization (Eq. 1).
Significance of Theorem 5. CSIB succeeds without assuming the support overlap for invariant features and can apply to multiple cases where IBIRM (as well as ERM, IRM, and IBERM) could fail by only requiring a single sample by further supervision. By such counterfactual intervention, CSIB works even when accessing data from a single environment, which is significant especially in the cases where multiple environmental data are not available.
4 Experiments
Following the motivating Examples, we perform experiments on three synthetic datasets from both the FIIF and PIIF cases to verify our method – counterfactual supervisionbased information bottleneck (CSIB) – and compare it to ERM, IBERM, IRM, and IBIRM. We follow the same protocol for tuning hyperparameters from
[3, 4, 1] and report the classification error for all experiments. In the following, we first briefly describe the designed datasets and then report the main results. More experimental details can be found in Appendix.4.1 Datasets
Example 1/1S (FIIF). The example is a modified one from the linear unit tests introduced in [4], which generalizes the cow/camel classification task with relevant backgrounds.
The dataset of each environment is sampled from the following distribution
We set for the first three environments, and for . The scrambling matrix
is an identical matrix in Example 1 and a random unitary matrix in Example 1S. Here, we set
and for all environments to make the spurious features and the invariant features both linearly separable to confuse each other. For the experiments on different values of andare presented in Appendix, where we have found very interesting observations related to the inductive bias of neural networks.
Example 2/2S (FIIF). This example is extended from the Example 2 to show one of the failure cases of IBIRM (as well as ERM, IRM, and IBERM) and how our method can be improved by intervention. Given , each instance in the environment data is sampled by
where we set and be the identical matrix in our experiments. We set , , , and if for different training environments. This example shows clear smaller entropy of spurious features than that of invariant features, which is opposite to the Example 1/1S.
Example 3/3S (PIIF). This example extends from the Example 3 and similar to the construction of Example 2/2S but in the PIIF setting. Let for different training environments. Each instance in the environments is sampled by
where we set in our experiments. The spurious features have smaller entropy than the invariant features in this example, which is similar to Example 2/2S, but the invariant features significantly enjoy much larger margin than the spurious features, which is very different from the above two examples. We make a summary to the properties of these three datasets in Table 1 for a general view.
4.2 Summary of results
Table 2
shows the classification errors of different methods when training data comes from single, three, and six environments. We can see that ERM and IRM fail to recognize the invariant features in the experiment of Example 1/1S, where invariant features have smaller margin than spurious features do, while information bottleneckbased methods (IBERM, IBIRM, and CSIB) show improved results due to the smaller entropy of the invariant features. Our method CSIB shows consistent results with IBIRM in Example 1/1S when invariant features are extracted in the first run, which verifies the effectiveness of information bottleneck for invariant feature learning in this case. In another FIIF setting of Example 2/2S, where the invariant features have larger entropy than spurious features do, we can see that only CSIB can remove the spurious features out among all comparing methods, although information bottleneckbased method IBERM would degrade the performance of ERM by focusing more on the spurious features. In the third experiments of Example 3/3S of PIIF setting, we can see that although ERM shows notbad results due to the significantly larger margin of invariant features, our method CSIB still shows improvements by removing out more spurious features. Notably, comparing to the IBERM and IBIRM when only spurious features are extracted (Example 2/2S, Example 3/3S), our method CSIB could effectively remove them by intervention and then refocus on the invariant features. Note that the reason of nonzero average error and the fluctuant results of CSIB in some experiments is because the entropy minimization in the training process is less accurate, where entropy is substituted by variance for the ease of the optimization. Nevertheless, there always exists a case where the entropy is indeed truly minimized and the error reaches zero (see (min) in the table) in Example 2/2S and Example 3/3S. In summary, CSIB improves others consistently in both FIIF and PIIF settings and are especially more effective than IBERM and IBIRM when the spurious features enjoy much smaller entropy than the invariant features do.
5 Related works
We divide the works related to OOD generalization into two categories: theory and methods, though some of them belong to both.
5.1 Theory of OOD generalization
Based on different definitions to the distributional changes, we review the corresponding theory by the following three categories.
Based on causality. Due to the close connection between the distributional changes and the interventions discussed in the theory of causality [34, 36], the problem of OOD generalization is usually built in the framework of causal learning. The theory states that a response is directly caused only by its parents variables and all interventions other that those on do not change the conditional distribution of . Such theory inspires a popular learning principle – invariance principle – that aims to discover a set of variables such that they remain invariance to the response in all observed environments [35, 18, 39]. Invariant risk minimization (IRM) [3] is then proposed to learn a feature extractor in an endtoend way such that the classifier learned based on the extracted features remains unchange in each environment. The theory in [3] shows the guarantee of IRM for OOD generalization under some general assumptions, but only focuses on the linear regression tasks. Different from the failure analyses of IRM for the classification tasks in [41, 20] in the PIIF setting, Ahuja et al., [1] first show that under the FIIF setting, linear classification is more difficult than linear regression, where invariance principle itself is insufficient to ensure the success of OOD generalization, and claim that the assumption of support overlap of invariant features is necessarily needed. They then propose a learning principle of information bottleneckbased invariant risk minimization (IBIRM) for linear classification, which shows how to address the failures of IRM by adding information bottleneck [47] into the learning. In this work, we closely investigate the conditions identified in [1] and first show that support overlap of invariant features is not necessarily needed for the success of OOD generalization. We further show several failure cases of IBIRM and propose improved results to it.
Recently, some of works are proposed to tackle the challenge of OOD generalization in the nonlinear regime [28, 26]. Commonly, both of them use variational auto encoder (VAE)based models [21, 38] to identify the latent variables from observations in the first stage. Then, these inferring latent variables are separated to two distinct parts of invariant (causal) and spurious (noncausal) features based on different assumptions on them. Specifically, Lu et al. [27, 28] assume that the latent variables conditioned on some accessible side information such as the environment index or class label are followed the exponential family distributions, and Liu et al. [26] directly disentangle the latent variables to two different parts during the inferring stage and assumes that the marginal distributions of them are independent to each other. These assumptions, however, are rather strong in general. Nevertheless, these solutions aim to capture the latent variables such that the response given these variables is invariant for different environments, which could still fail in the FIIF setting where the invariance principle itself is insufficient for OOD generalization in the classification tasks, as shown in [1]. In this work, we focus on the linear classification only and show a new theory of a new method that well addresses most of the OOD generalization failures in both PIIF and FIIF settings.
Based on robustness. Different from those based on the causality, where different distributions are generated by a same causal graph and the goal is to discover causal features, the robustness based methods aim to protect the model against the potential distributional shifts within the uncertainty set, which is usually constrained by fdivergence [32] or Wasserstein distance [43]. This series of works are theoretically addressed by distributionally robust optimization (DRO) under a minimax framework [23, 12]. Recently, some of works tend to connect the connections between causality and robustness [10]. Although these works show less relevance to us, it is possible that a welldefined measure of distribution divergence could help to effectively extract causal features under the robustness framework. This would be an interesting avenue for future research.
Others.
Some other works assume that the distributions (domains) are generated from a hyperdistribution and aims to minimize the average risk estimation error bound
[9, 30, 11]. These works are often built based on the generalization theory under the independent and identically distributed (IID) assumption. And in [53], it does not make any assumption to the distributional changes, and only studies the learnability of OOD generalization in a general way. All of these theories can not cover the OOD generalization problem under a single training environment or domain.5.2 Methods of OOD generalization
Based on invariance principle. Inspired from the invariance principle [35, 18], many methods are proposed by designing various loss to extract features to better satisfy the principle itself. IRMv1 [3] is the first objective to address this in an endtoend way by adding a gradient penalty to the classifier. Following this work, Krueger et al. [22] suggest penalizing the variance of the risks, while Xie et al. [51] give the same objective but taking the square root of the variance. And many other alternatives could also be found [19, 29, 6]. It is clear that all of these methods aims to find an invariant predictor. Recently, Ahuja et al. [1] find that for classification problem, finding the invariant predictor is not enough to extract causal features since the features could include the spurious information to make the predictor invariant across training environments, and they propose IBIRM to address such failure. Similar idea to IBIRM could also be found in the work [24], where a different loss function is proposed to achieve the same purpose. More recently, Wang et al. [49] propose the similar ideas to ours but only tackle the situation when the invariant features have the same distribution among all environments. In this work, we further show that IBIRM could still fail in several cases due to the model may only rely on spurious features to make the predictor invariance. We then propose counterfactual supervisionbased information bottleneck (CSIB) method to address such failures and show improving results to the prior works.
Based on distribution matching. It is worth to note that there exist many works focused on learning domain invariant features representations [13, 25, 56]. Most of these works are inspired by the seminal theory of domain adaptation [8, 7]. The goal of these methods is to learn a feature extractor such that the marginal distribution of or the conditional distribution of is invariant across different domains. This is different from the invariance principle, where the goal is to make (or ) invariant. We refer readers to the papers of [3, 55] for better understanding the details of why these distribution matching based methods often fail to address OOD generalization.
Others. Other related methods are various, including by using data augmentation in both image level [52] or feature level [57], by removing spurious correlations through stable learning [54], and by utilizing the inductive bias of neural network [15, 48] etc. Most of these methods are empirically inspired from the experiments and are verified to some specific datasets. Recently, an empirical study in [16, 50] notices that the real effects of many OOD generalization (domain generalization) methods are weak, which indicates that the benchmarkbased evaluation criterions may be inadequate to validate the OOD generalization algorithms.
6 Conclusion, limitations and future work
In this paper, we focus on the OOD generalization problem of linear classification. We first revisit the fundamental assumptions and results of prior works and show that the condition of invariant features support overlap is not necessarily needed for the success of OOD generalization and thus propose a weaker counterpart. Then, we show several failure cases of IBIRM (as well as ERM, IBERM, and IRM) and illustrate its intrinsic causes by theoretical analysis. Motivating by that, we further propose a new method – counterfactual supervisionbased information bottleneck (CSIB) and theoretically prove its effectiveness under some weaker assumptions. CSIB works even when accessing data from a single environment, and has theoretically consistent results for both binary and multiclass problems. Finally, we design several synthetic datasets by our motivating examples for the experimental verification. Empirical observations among all comparing methods illustrate the effectiveness of CSIB.
Since we only take the linear problem into interest, including linear representation and linear classifier, any nonlinear case of that would not be guaranteed by our theoretical results and thus CSIB may fail. Therefore, the same as prior works (IRM [3] and IBIRM [1]), nonlinear challenge is still an unsolved problem [41, 20]. We believe this is of great value for investigating in future work since widely used data in the wild are nonlinearly generated. Another fruitful direction is to design a powerful algorithm for entropy minimization during the learning process of CSIB. Currently, we use the variance of features to replace the entropy of the features during the optimization. However, variance and entropy are essentially different but a truly effective entropy minimization is the key to the success of CSIB. Another limitation of our method is that we have to require a further supervision to the counterfactual examples during the learning time, although it only takes one time for a single step.
References
 [1] (2021) Invariance principle meets information bottleneck for outofdistribution generalization. Neural Information Processing Systems 34. Cited by: §A.1.1, §A.1.2, §A.2.1, Counterfactual Supervisionbased Information Bottleneck for OutofDistribution Generalization, §1, §1, §2, §2, §2, §2, §2, §2, §4, §5.1, §5.1, §5.2, §6, Theorem 1, Theorem 2.
 [2] (2020) Invariant risk minimization games. In International Conference on Machine Learning, pp. 145–155. Cited by: §1.
 [3] (2019) Invariant risk minimization. arXiv preprint arXiv:1907.02893. Cited by: Counterfactual Supervisionbased Information Bottleneck for OutofDistribution Generalization, §1, §2, §2, §4, §5.1, §5.2, §5.2, §6.
 [4] (2021) Linear unittests for invariance discovery. arXiv preprint arXiv:2102.10867. Cited by: §A.1.2, §A.1.3, §4.1, §4.

[5]
(2018)
Recognition in terra incognita.
In
European Conference on Computer Vision
, pp. 456–473. Cited by: §1.  [6] (2020) Generalization and invariances in the presence of unobserved confounding. arXiv preprint arXiv:2007.10653 11. Cited by: §5.2.
 [7] (2010) A theory of learning from different domains. Machine learning 79 (1), pp. 151–175. Cited by: §5.2.
 [8] (2006) Analysis of representations for domain adaptation. Advances in neural information processing systems 19. Cited by: §5.2.
 [9] (2011) Generalizing from several related classification tasks to a new unlabeled sample. Advances in neural information processing systems 24. Cited by: §5.1.
 [10] (2020) Invariance, causality and robustness. Statistical Science 35 (3), pp. 404–426. Cited by: §5.1.
 [11] (2019) A generalization error bound for multiclass domain generalization. arXiv preprint arXiv:1905.10392. Cited by: §5.1.
 [12] (2021) Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics 49 (3), pp. 1378–1406. Cited by: §5.1.

[13]
(2015)
Unsupervised domain adaptation by backpropagation
. In International conference on machine learning, pp. 1180–1189. Cited by: §5.2.  [14] (2020) Shortcut learning in deep neural networks. Nature Machine Intelligence 2 (11), pp. 665–673. Cited by: §1.
 [15] (2019) ImageNettrained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations, Cited by: §1, §5.2.
 [16] (2020) In search of lost domain generalization. In International Conference on Learning Representations, Cited by: §5.2.
 [17] (2018) Annotation artifacts in natural language inference data. In NAACLHLT (2), Cited by: §1.
 [18] (2018) Invariant causal prediction for nonlinear models. Journal of Causal Inference 6 (2). Cited by: §5.1, §5.2.
 [19] (2020) Domain extrapolation via regret minimization. arXiv preprint arXiv:2006.03908. Cited by: §5.2.

[20]
(2021)
Does invariant risk minimization capture invariance?.
In
International Conference on Artificial Intelligence and Statistics
, pp. 4069–4077. Cited by: §5.1, §6.  [21] (2013) Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §5.1.
 [22] (2021) Outofdistribution generalization via risk extrapolation (rex). In International Conference on Machine Learning, pp. 5815–5826. Cited by: §1, §5.2.
 [23] (2018) Minimax statistical learning with wasserstein distances. Advances in Neural Information Processing Systems 31. Cited by: §5.1.
 [24] (2022) Invariant information bottleneck for domain generalization. In Association for the Advancement of Artificial Intelligence, Cited by: §5.2.
 [25] (2018) Deep domain generalization via conditional invariant adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 624–639. Cited by: §5.2.
 [26] (2021) Learning causal semantic representation for outofdistribution prediction. Neural Information Processing Systems 34. Cited by: §5.1.
 [27] (2021) Nonlinear invariant risk minimization: a causal approach. arXiv preprint arXiv:2102.12353. Cited by: §5.1.
 [28] (2022) Invariant causal representation learning for outofdistribution generalization. In International Conference on Learning Representations, External Links: Link Cited by: §5.1.
 [29] (2021) Domain generalization using causal matching. In International Conference on Machine Learning, pp. 7313–7324. Cited by: §5.2.
 [30] (2013) Domain generalization via invariant feature representation. In International Conference on Machine Learning, pp. 10–18. Cited by: §5.1.
 [31] (2021) Understanding the failure modes of outofdistribution generalization. In International Conference on Learning Representations, Cited by: Counterfactual Supervisionbased Information Bottleneck for OutofDistribution Generalization.
 [32] (2016) Stochastic gradient methods for distributionally robust optimization with fdivergences. Advances in neural information processing systems 29. Cited by: §5.1.

[33]
(2015)
Deep neural networks are easily fooled: high confidence predictions for unrecognizable images.
In
Computer Vision and Pattern Recognition Conference
, pp. 427–436. Cited by: §1.  [34] (2009) Causality. Cambridge university press. Cited by: §1, §5.1.

[35]
(2016)
Causal inference by using invariant prediction: identification and confidence intervals
. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78 (5), pp. 947–1012. Cited by: §1, §5.1, §5.2.  [36] (2017) Elements of causal inference: foundations and learning algorithms. The MIT Press. Cited by: §1, §5.1.
 [37] (2021) Gradient starvation: a learning proclivity in neural networks. Neural Information Processing Systems 34. Cited by: §1.
 [38] (2014) Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pp. 1278–1286. Cited by: §5.1.

[39]
(2018)
Invariant models for causal transfer learning
. The Journal of Machine Learning Research 19 (1), pp. 1309–1342. Cited by: §5.1.  [40] (2018) The elephant in the room. arXiv preprint arXiv:1808.03305. Cited by: §1.
 [41] (2021) The risks of invariant risk minimization. In International Conference on Learning Representations, Cited by: Counterfactual Supervisionbased Information Bottleneck for OutofDistribution Generalization, §5.1, §6.
 [42] (2014) Understanding machine learning: from theory to algorithms. Cambridge university press. Cited by: §2.
 [43] (2017) Certifying some distributional robustness with principled adversarial training. arXiv preprint arXiv:1710.10571. Cited by: §5.1.
 [44] (2018) The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research 19 (1), pp. 2822–2878. Cited by: Table 1.
 [45] (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1.
 [46] (2006) Elements of information theory. WileyInterscience. Cited by: §A.2.1.
 [47] (1999) The information bottleneck method. In Proc. 37th Annual Allerton Conference on Communications, Control and Computing, pp. 368–377. Cited by: §2, §5.1.
 [48] (2019) Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems, Vol. 32. Cited by: §5.2.
 [49] (2022) Provable domain generalization via invariantfeature subspace recovery. In International Conference on Machine Learning, Cited by: §5.2.
 [50] (2022) A finegrained analysis on distribution shift. In International Conference on Learning Representations, Cited by: §5.2.
 [51] (2020) Risk variance penalization: from distributional robustness to causality. arXiv preprint arXiv:2006.07544 1. Cited by: §5.2.
 [52] (2021) A fourierbased framework for domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14383–14392. Cited by: §5.2.
 [53] (2021) Towards a theoretical framework of outofdistribution generalization. In Neural Information Processing Systems, Cited by: §5.1.
 [54] (2021) Deep stable learning for outofdistribution generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5372–5382. Cited by: §5.2.
 [55] (2019) On learning invariant representations for domain adaptation. In International Conference on Machine Learning, pp. 7523–7532. Cited by: §5.2.
 [56] (2020) Domain generalization via entropy regularization. Advances in Neural Information Processing Systems 33, pp. 16096–16107. Cited by: §5.2.
 [57] (2021) Domain generalization with mixstyle. In International Conference on Learning Representations, Cited by: §5.2.
Appendix A Appendix
a.1 Experiments details
In this section, we provide more details on the experiments. The code to reproduce the experiments can be found at https://github.com/szubing/CSIB.
a.1.1 Optimization loss of IBERM
The objective function of IBERM is as follow:
(10) 
Since the entropy of is hard to estimate by a differential variable that can be optimized by using gradient descent, we follow [1] by using the variance instead of the entropy for optimization. The total loss function is given by
(11) 
with a hyperparameter onto it.
a.1.2 Experiments setup
Model, hyperparameters, loss, and evaluation. In all experiments, we follow the same protocol as prescribed by [4, 1] for the model / hyperparameter selection, training, and evaluation. Except those specified, for all experiments across three Examples and five comparing methods, the model is the same with a linear feature extractor followed by a linear classifier . We use binary crossentropy loss for classification. All hyperparameters, including the learning rate, the penalty term in IRM, or the associated with the Var in Eq. (11), etc., are randomly searched and selected by using 20 test samples for validation. The results reported in the main manuscript use 3 hyperparameter queries of each and average over 5 data seeds. The results when searching over more hyperparameter values are reported in the supplementary experiments. The search spaces of all the hyperparameters follow the same as in [4, 1]. The classification test errors between 0 and 1 are reported.
Compute description. Our computing resource is one GPU of NVIDIA GeForce GTX 1080 Ti with 6 CPU cores of Intel(R) Core(TM) i78700 CPU @ 3.20GHz.
Existing codes and datasets used. In our experiments, we mainly rely on the following two github repositories: InvarianceUnitTests^{2}^{2}2https://github.com/facebookresearch/InvarianceUnitTests and IBIRM^{3}^{3}3https://github.com/ahujak/IBIRM.
a.1.3 Supplementary experiments
The purpose of the first supplementary experiment is to illustrate what the result would be when we increase the number of running seeds in the hyperparameters selection. These results are shown in Tab. A1, where we increase the number of hyperparameter queries to 10 of each. It is clear that in overall, the results of CSIB in Tab. A1 are much better and have less fluctuations than those in Tab. 2, and the conclusions remain almost the same as we have summarized in section 4.2. This further verifies the effectiveness of CSIB method.