The effectiveness of machine learning algorithms with empirical risk minimization (ERM) relies on the assumption that the testing and training data are identically drawn from the same distribution, which is known as the IID hypothesis. However, distributional shifts between testing and training data are usually inevitable due to data selection biases or unobserved confounders that widely exist in real data. Under such circumstances, machine learning algorithms with ERM usually suffer from poor generalization performance due to the greedy exploitation of correlations among the training data, which are not stable under distributional shifts. How to guarantee a machine learning algorithm with out-of-distribution (OOD) generalization ability and stable performances under distributional shifts is of paramount significance, especially in high-stake applications such as medical diagnosis, criminal justice, and financial analysis etc (Kukar, 2003; Berk et al., 2018; Rudin and Ustun, 2018).
There are mainly two branches of methods proposed to solve the OOD generalization problem, namely distributionally robust optimization (DRO) (Esfahani and Kuhn, 2018; Duchi and Namkoong, 2018; Sinha et al., 2018; Sagawa et al., 2019) and invariant learning (Arjovsky et al., 2019; Koyama and Yamaguchi, 2020; Chang et al., 2020). DRO methods aim to optimize the worst-performance over a distribution set to ensure their OOD generalization performances. While DRO is a powerful family of methods, it is often argued for its over-pessimism problem when the distribution set is large (Hu et al., 2018; Frogner et al., 2019). From another perspective, invariant learning methods propose to exploit the causally invariant correlations(rather than varying spurious correlations) across multiple training environments, resulting in out-of-distribution (OOD) optimal predictors. However, the effectiveness of such methods relies heavily on the quality of training environments, and the intrinsic role of environments in invariant learning remains vague in theory. More importantly, modern big data are frequently assembled by merging data from multiple sources without explicit source labels. The resultant unobserved heterogeneity renders these invariant learning methods inapplicable.
In this paper, we propose Heterogeneous Risk Minimization (HRM), an optimization framework to achieve joint learning of the latent heterogeneity among the data and the invariant predictor, which leads to better generalization ability despite distributional shifts. More specifically, we theoretically characterize the roles of the environment labels in invariant learning, which motivates us to design two modules in the framework corresponding to heterogeneity identification and invariant learning respectively. We provide theoretical justification on the mutual promotion of these two modules, which resonates the joint optimization process in a reciprocal way. Extensive experiments in both synthetic and real-world experiments datasets demonstrate the superiority of HRM in terms of average performance, stability performance as well as worst-case performance under different settings of distributional shifts. We summarize our contributions as following:
1. We propose the novel HRM framework for OOD generalization without environment labels, in which heterogeneity identification and invariant prediction are jointly optimized.
2. We theoretically characterize the role of environments in invariant learning from the perspective of heterogeneity, based on which we propose a novel clustering method for heterogeneity identification from heterogeneous data.
3. We theoretically justify the mutual promotion relationship between heterogeneity identification and invariant learning, resonating the joint optimization process in HRM.
2 Problem Formulation
2.1 OOD and Maximal Invariant Predictor
Following (Arjovsky et al., 2019; Chang et al., 2020), we consider a dataset , which is a mixture of data collected from multiple training environments , and are the -th data and label from environment respectively and is number of samples in environment . Environment labels are unavailable as in most real applications.
is a random variable on indices of training environments andis the distribution of data and label in environment .
The goal of this work is to find a predictor with good out-of-distribution generalization performance, which can be formalized as:
where is the risk of predictor on environment , and
is the loss function.is the random variable on indices of all possible environments such that . Usually, for all , the data and label distribution can be quite different from that of training environments . Therefore, the problem in Equation 1 is referred to as Out-of-Distribution (OOD) Generalization problem (Arjovsky et al., 2019).
Without any prior knowledge or structural assumptions, it is impossible to figure out the OOD generalization problem, since one cannot characterize the unseen latent environments in . A commonly used assumption in invariant learning literature (Rojas-Carulla et al., 2015; Gong et al., 2016; Arjovsky et al., 2019; Kuang et al., 2020; Chang et al., 2020) is as follow:
There exists random variable such that the following properties hold:
a. : for all , we have holds.
b. : .
This assumption indicates invariance and sufficiency for predicting the target using , which is known as invariant covariates or representations with stable relationships with across different environments .
In order to acquire the invariant predictor , a branch of work to find maximal invariant predictor (Chang et al., 2020; Koyama and Yamaguchi, 2020) has been proposed, where the invariance set and the corresponding maximal invariant predictor are defined as:
The invariance set with respect to is defined as:
where is the Shannon entropy of a random variable. The corresponding maximal invariant predictor (MIP) of is defined as:
where measures Shannon mutual information between two random variables.
Recently, some works suppose the availability of data from multiple environments with environment labels, wherein they can find MIP (Chang et al., 2020; Koyama and Yamaguchi, 2020). However, they rely on the underlying assumption that the invariance set of is exactly the invariance set of all possible unseen environments , which cannot be guaranteed as shown in Theorem 2.2.
As shown in Theorem 2.2 that , the learned predictor is only invariant to such limited environments but is not guaranteed to be invariant with respect to all possible environments .
Here we give a toy example in Table 1 to illustrate this. We consider a binary classification between cats and dogs, where each photo contains 3 features, animal feature , a background feature and the photographer’s signature feature . Assume all possible testing environments and the train environment , then while . The reason is that only tell us cannot be included in the invariance set but cannot exclude . But if and can be further divided into and respectively, the invariance set becomes .
This example shows that the manually labeled environments may not be sufficient to achieve MIP, not to mention the cases where environment labels are not available. This limitation necessitates the study on how to exploit the latent intrinsic heterogeneity in training data (like and in the above example) to form more refined environments for OOD generalization. The environments need to be subtly uncovered, in the sense of OOD generalization problem, as indicated by Theorem D.4, not all environments are helpful to tighten the invariance set.
Given set of environments , denote the corresponding invariance set and the corresponding maximal invariant predictor . For one newly-added environment with distribution , if for , the invariance set constrained by is equal to .
|Class 0 (Cats)||Class 1 (Dogs)|
|Mixture: 90% data from and 10% data from|
|Mixture: 90% data from and 10% data from|
2.2 Problem of Heterogeneous Risk Minimization
Besides Assumption 2.1, we make another assumption on the existence of heterogeneity in training data as:
The heterogeneity among provided environments can be evaluated by the compactness of the corresponding invariance set as . Specifically, smaller leads to higher heterogeneity, since more variant features can be excluded. Based on the assumption, we come up with the problem of heterogeneity exploitation for OOD generalization.
Heterogeneous Risk Minimization.
Given heterogeneous dataset without environment labels, the task is to generate environments with minimal and learn invariant model under learned with good OOD performance.
Theorem D.4 together with Assumption 2.2 indicate that, to better constrain , the effective way is to generate environments with varying that can exclude variant features from . Under this problem setting, we encounter the circular dependency: first we need variant to generate heterogeneous environments ; then we need to learned invariant as well as variant . Furthermore, there exists positive feedback between these two steps. When acquiring with tighter , more invariant predictor (i.e. a better approximation of MIP) can be found, which will further bring a clearer picture of variant parts, and therefore promote the generation of . With this notion, we propose our framework for Heterogeneous Risk Minimization (HRM) which leverages the mutual promotion between the two steps and conduct joint optimization.
In this work, we temporarily focus on a simple but general setting, where in raw feature level and satisfy Assumption 2.1. Under this setting, Our Heterogeneous Risk Minimization (HRM) framework contains two interactive parts, the frontend for heterogeneity identification and the backend for invariant prediction. The general framework is shown in Figure 1.
Given the pooled heterogeneous data, it starts with the heterogeneity identification module leveraging the learned variant representation to generate heterogeneous environments . Then the learned environments are used by OOD prediction module to learn the MIP as well as the invariant prediction model . After that, we derive the variant to further boost the module , which is supported by Theorem D.4
. As for the ’convert’ step, under our setting, we adopt feature selection in this work, through which more variant featurecan be attained when more invariant feature is learned. Specifically, the invariant predictor is generated as , and the variant part correspondingly, where is the binary invariant feature selection mask. For instance, for Table 1, , the ground truth binary mask is . In this way, the better is learned, the better can be obtained. Note that we use the soft selection which is more flexible and general in our algorithm with .
The whole framework is jointly optimized, so that the mutual promotion between heterogeneity identification and invariant learning can be fully leveraged.
3.1 Implementation of
Here we introduce our invariant prediction module , which takes multiple environments training data as input, and outputs the corresponding invariant predictor and the indices of invariant features given current environments . We combine feature selection with invariant learning under heterogeneous environments, which can select the features with stable/invariant correlations with the label across . Specifically, the former module can select most informative features with respect to the loss function and latter module ensures the selected features are invariant. Their combination ensures to select the most informative invariant features.
For invariant learning, we follow the variance penalty regularizer proposed in(Koyama and Yamaguchi, 2020) and simplify it in feature selection scenarios. The objective function of with is:
However, as the optimization of hard feature selection with binary mask suffers from high variance, we use the soft feature selection with gates taking continuous value in . Specifically, following (Yamada et al., 2020), we approximate each element of to clipped Gaussian random variable parameterized by as
where is drawn from . With this approximation, the objective function with soft feature selection can be written as:
is a random vector withindependent variables for . Under the approximation in Equation 6, is simply and can be calculated as , where is the standard Gaussian CDF. We formulate our objective as risk minimization problem:
Then we obtain and when we obtain as well as . Further in Section 4, we theoretically prove that the prediction module is able to learn the MIP with respect to given environments .
3.2 Implementation of
Notation. means the learned variant part . means -dimension simplex. means the function parameterized by .
The heterogeneity identification module takes a single dataset as input, and outputs a multi-environment dataset partition for invariant prediction. We implement it with a clustering algorithm. As indicated in Theorem D.4, the more diverse for our generated environments, the better the invariance set is. Therefore, we cluster the data points according to the relationship between and , for which we use as the cluster centre. Note that is initialized as in our joint optimization.
Specifically, we assume the -th cluster centre parameterized by to be a Gaussian around as :
For the given empirical data samples , the empirical distribution is modeled as where
The target of our heterogeneous clustering is to find a distribution in to fit the empirical distribution best. Therefore, the objective function of our heterogeneous clustering is:
The above objective can be further simplified to:
As for optimization, we use EM algorithm to optimize the centre parameter and the mixture weight . After optimizing equation 13, for building , we assign each data point to environment
In this way, is generated by .
4 Theoretical Analysis
In this section, we theoretically analyze our proposed Heterogeneous Risk Minimization (HRM) method. We first analyze our proposed and , and then justify the existence of the positive feedback in our HRM.
Theoretical Demonstration of We theoretical interpret our heterogeneity identification module from the perspective of rate-distortion theory, which is left to appendix due to space limitations.
Justification of We prove that given training environments , our invariant prediction model can learn the maximal invariant predictor with respect to the corresponding invariance set .
Given , the learned is the maximal invariant predictor of .
Justification of the Positive Feedback The core of our HRM framework is the mechanism for and to mutual promote each other. Here we theoretically justify the existence of such positive feedback. In Assumption 2.1, we assume the invariance and sufficiency properties of the stable features and assume the relationship between unstable part and can arbitrarily change. Here we make a more specific assumption on the heterogeneity across environments with respect to and .
Assume the pooled training data is made up of heterogeneous data sources: . For any , we assume
where is invariant feature and the variant. represents mutual information in and represents the cross mutual information between and takes the form of and .
Intuitively, Assumption 4.1 assumes that invariant feature provides more information for predicting across environments than in one single environment, and correspondingly, the information provided by shrinks a lot across environments, which indicates that the relationship between variant feature and varies across environments. Based on this assumption, we first prove that the cluster centres are pulled apart as invariant feature is excluded from clustering.
Theorem D.6 indicates that the distance between cluster centres is larger when using variant features , therefore, it is more likely to obtain the desired heterogeneous environments, which explains why we use learned variant part for clustering. Finally, we provide the theorem for optimality guarantee for our HRM.
Under Assumption 2.1 and 4.1, for the proposed and , we have the following conclusions: 1. Given environments such that , the learned by is the maximal invariant predictor of . 2. Given the maximal invariant predictor of , assume the pooled training data is made up of data from all environments in , there exists one split that achieves the minimum of the objective function and meanwhile the invariance set regularized is equal to .
Intuitively, Theorem 4.3 proves that given one of the and optimal, the other is optimal, which validates the existence of the global optimal point of our algorithm. The theoretical relationship between our and rate-distortion theory as well as proofs for the above theorems is left to appendix.
In this section, we validate the effectiveness of our method on simulation data and real-world data.
Baselines We compare our proposed HRM with the following methods:
Empirical Risk Minimization(ERM):
Distributionally Robust Optimization(DRO (Sinha et al., 2018)):
Environment Inference for Invariant Learning(EIIL (Creager et al., 2020)):
Invariant Risk Minimization(IRM (Arjovsky et al., 2019)) with environment labels:
Further, for ablation study, we also compare with HRM, which runs HRM for only one iteration without the feedback loop. Note that IRM is based on multiple training environments and we provide environment labels for IRM, while others do not need environment labels.
Evaluation Metrics To evaluate the prediction performance, we use defined as , defined as
, which are mean and standard deviation error acrossand , which are mean error, standard deviation error and worst-case error across .
Imbalanced Mixture It is a natural phenomena that empirical data follow a power-law distribution, i.e. only a few environments/subgroups are common and the rest are rare (Shen et al., 2018; Sagawa et al., 2019, 2020). Therefore, we perform non-uniform sampling among different environments in training set.
|Scenario 1: varying selection bias rate ()|
|Scenario 2: varying dimension ()|
5.1 Simulation Data
We design two mechanisms to simulate the varying correlations among covariates across environments, named by selection bias and anti-causal effect.
|Training environments||Testing environments|
|Training environments||Testing environments|
Selection Bias In this setting, the correlations between variant covariates and the target are perturbed through selection bias mechanism. According to Assumption 2.1, we assume and and that remains invariant across environments while changes arbitrarily. For simplicity, we select data points according to a certain variable set :
where , and denotes the probability of point to be selected. Intuitively, eventually controls the strengths and direction of the spurious correlation between and (i.e. if , a data point whose is close to its is more probably to be selected.). The larger value of means the stronger spurious correlation between and , and means positive correlation and vice versa. Therefore, here we use to define different environments.
In training, we generate data points, where points from environment with a predefined and points from with . In testing, we generate data points for 10 environments with .
is set to 1.0. We compare our HRM with ERM, DRO, EIIL and IRM for Linear Regression. We conduct extensive experiments with different settings on, and . In each setting, we carry out the procedure 10 times and report the average results. The results are shown in Table 4.
From the results, we have the following observations and analysis: ERM suffers from the distributional shifts in testing and yields poor performance in most of the settings. DRO surprisingly has the worst performance, which we think is due to the over-pessimism problem (Frogner et al., 2019). EIIL has the similar performance with ERM, which indicates that its inferred environments cannot reveal the spurious correlations between and . IRM performs much better than the above two baselines, however, as IRM depends on the available environment labels to work, it uses much more information than the other three methods. Compared to the three baselines, our HRM achieves nearly perfect performance with respect to average performance and stability, especially the variance of losses across environments is close to 0, which reflects the effectiveness of our heterogeneous clustering as well as the invariant learning algorithm. Furthermore, our HRM does not need environment labels, which verifies that our clustering algorithm can mine the latent heterogeneity inside the data and further shows our superiority to IRM.
Besides, we visualize the differences between environments using Task2Vec (Achille et al., 2019) in Figure 2, where larger value means the two environments are more heterogeneous. The pooled training data are mixture of environments with and , the difference between whom is shown in yellow box. And the red boxes show differences between learned environments by HRM and HRM. The big promotion between and verifies our HRM can exploit heterogeneity inside data as well as the existence of the positive feedback. Due to space limitation, results of varying as well as experimental details are left to appendix.
Anti-causal Effect Inspired by (Arjovsky et al., 2019), we induce the spurious correlation by using anti-causal relationship from the target to the variant covariates . In this experiment, we assume and firstly sample
from mixture Gaussian distribution characterized asand the target . Then the spurious correlations between and are generated by anti-causal effect as
where means the Gaussian noise added to depends on which component the invariant covariates belong to. Intuitively, in different Gaussian components, the corresponding correlations between and are varying due to the different value of . The larger the is, the weaker correlation between and . We use the mixture weight to define different environments, where different mixture weights represent different overall strength of the effect on .
In this experiment, we set and build 10 environments with varying and the dimension of , the first three for training and the last seven for testing. We run experiments for 10 times and the averaged results are shown in Table 3. EIIL achieves the best training performance with respect to prediction errors on training environments , , , while its performances in testing are poor. ERM suffers from distributional shifts in testing. DRO seeks for over-considered robustness and performs much worse. IRM performs much better as it learns invariant representations with help of environment labels. HRM achieves nearly uniformly good performance in training environments as well as the testing ones, which validates the effectiveness of our method and proves its excellent generalization ability.
5.2 Real-world Data
We test our method on three real-world tasks, including car insurance prediction, people income prediction and house price prediction.
Car Insurance Prediction In this task, we use a real-world dataset for car insurance prediction (Kaggle). It is a classification task to predict whether a person will buy car insurance based on related information, such as vehicle damage, annual premium, vehicle age etc111https://www.kaggle.com/anmolkumar/health-insurance-cross-sell-prediction. We impose selection bias mechanism on the correlation between the outcome (i.e. the label indicating whether buying insurance) and the sex attribute to simulate multiple environments. Specifically, we simulate different strengths of the spurious correlation between sex and target in training, and reverse the direction of such correlation in testing( in training and in testing). For IRM, in each setting, we divide the training data into three training environments with , and different overall correlation corresponds to different numbers of data in . We perform 5 experiments with varying and the results in both training and testing are shown in Figure 3(a).
People Income Prediction In this task we use the Adult dataset (Dua and Graff, 2017) to predict personal income levels as above or below $50,000 per year based on personal details. We split the dataset into 10 environments according to demographic attributes and . In training phase, all methods are trained on pooled data including 693 points from environment 1 and 200 from environment 2, and validated on 100 sampled from both. For IRM, the ground-truth environment labels are provided. In testing phase, we test all methods on the 10 environments and report the mis-classification rate on all environments in Figure 3(b).
House Price Prediction In this experiment, we use a real-world regression dataset (Kaggle) of house sales prices from King County, USA222 https://www.kaggle.com/c/house-prices-advanced-regression- techniques/data. The target variable is the transaction price of the house and each sample contains 17 predictive variables such as the built year of the house, number of bedrooms, and square footage of home, etc. We simulate different environments according to the built year of the house, since it is fairly reasonable to assume the correlations among covariates and the target may vary along time. Specifically, we split the dataset into 6 periods, where each period approximately covers a time span of two decades. All methods are trained on data from the first period() and test on the other periods. For IRM, we further divide the training data into two environments where and respectively. Results are shown in Figure 3(c).
From the results of three real-world tasks, we have the following observations and analysis: ERM achieves high accuracy in training while performing much worse in testing, indicating its inability in dealing with OOD predictions. DRO’s performance is not satisfactory, sometimes even worse than ERM. One plausible reason is its over-pessimistic nature which leads to too conservative predictors. Comparatively, invariant learning methods perform better in testing. IRM performs better than ERM and DRO, which shows the usefulness of environment labels for OOD generalization and the possibility of learning invariant predictor from multiple environments. EIIL performs inconsistently across different tasks, possibly due to its instability of the environment inference method, which we provide a detailed discussion in appendix. In all tasks and almost all testing environments (16/18), HRM consistently achieves the best performances. HRM even outperforms IRM significantly in a unfair setting where we provide perfect environment labels for IRM. One one side, it shows the limitation of manually labeled environments. On the other side, it demonstrates that, relieving the dependence on environment labels, HRM can effectively uncover and fully leverage the intrinsic heterogeneity in training data for invariant learning.
Full Experiment details are provided in appendix.
In this paper, we propose the Heterogeneous Risk Minimization framework for the OOD generalization problem. Empirical studies validate the effectiveness of our method in terms of OOD prediction performances. We mainly focus on the raw variable level with the assumption of . This setting is able to cover a broad spectrum of applications, e.g. healthcare, finance, marketing etc, where the raw variables are informative enough. To further extend the power of HRM, we will consider to incorporate representation learning from in future work. Also, the effectiveness of HRM relies on the heterogeneity assumption, i.e. the training data should contain sufficient heterogeneity to uncover the commonality for invariant prediction. How to quantify the sufficient and necessary condition of heterogeneity is also an interesting problem for future.
Appendix A Additional Simulation Results and Details
Selection Bias In this setting, the correlations among covariates are perturbed through selection bias mechanism. According to assumption 2.1, we assume and is independent from while the covariates in are dependent with each other. We assume and remains invariant across environments while can arbitrarily change.
Therefore, we generate training data points with the help of auxiliary variables as following:
To induce model misspecification, we generate as:
where , and . As we assume that remains unchanged while can vary across environments, we design a data selection mechanism to induce this kind of distribution shifts. For simplicity, we select data points according to a certain variable set :
where and . Given a certain , a data point is selected if and only if (i.e. if , a data point whose is close to its is more probably to be selected.)
Intuitively, eventually controls the strengths and direction of the spurious correlation between and (i.e. if , a data point whose is close to its is more probably to be selected.). The larger value of means the stronger spurious correlation between and , and means positive correlation and vice versa. Therefore, here we use to define different environments.
In training, we generate data points, where points from environment with a predefined and points from with . In testing, we generate data points for 10 environments with . is set to 1.0.
Apart from the two scenarios in main body, we also conduct scenario 3 and 4 with varying and respectively.
|Scenario 3: varying ratio and sample size ()|
|Scenario 4: varying variant dimension ()|
Anti-Causal Effect Inspired by (Arjovsky et al., 2019), in this setting, we introduce the spurious correlation by using anti-causal relationship from the target to the variant covariates .
We assume and , Data Generation process is as following: