1 Introduction
Many machine learning algorithms have been shown to be very successful for prediction when the test data have the same distribution as the training data. In real scenarios, however, we cannot guarantee the unknown test data will have the same distribution as the training data. For example, different geographies, schools, or hospitals may draw from different demographics, and the correlation structure among demographics may also vary (e.g., one ethnic group may be more or less disadvantaged in different geographies). The model may exploit subtly genuine statistical relationships among predictors present in the training data to improve prediction, resulting in the instability of prediction across test data that out of training distribution. Hence, how to learn a model for stable prediction across unknown test data is of paramount importance for both academic research and practical applications.
To address the stable/invariant prediction problem, recently, many algorithms have been proposed, including domain generalization Muandet et al. (2013)
, causal transfer learning
RojasCarulla et al. (2018) and invariant causal prediction Peters et al. (2016). The motivation of these methods is to explore the invariant or stable structure between predictors and the response variable across multiple training data for stable prediction. But they cannot handle the test data whose distribution are out of all training environments. Kuang et al. Kuang et al. (2018, 2020) proposed to recover causation between predictors and response variable by global sample weighting, and separate causal variables for stable prediction. However, they either assume all predictors are binary or analyze based on linear model, which are impractical in real scenarios.In the stable prediction problem Kuang et al. (2018), all predictors can be separated into two categories, including causal variables and noncausal variables , by whether it has causal effect on the response variable or not, that is . For example, ears, noses, and whiskers are causal variables of cats to identify whether an image contains a cat or not, while the grass or other backgrounds are noncausal variables to recognize the cat. Then, the generation of the response variable can be denoted as , where noncausal variables should be independent with the response variable conditional on the full sets of causal variables . But they might be spuriously correlated with either causal variables, response variable or both because of sample selection bias in data. For example, the variable “grass” would be spuriously correlated with label “cat” and become a powerful predictor if we select many images with “cat on the grass” as training data. Those spurious correlations between noncausal variables and the response variable are varied and unstable across datasets with different distributions, leading to unstable prediction across unknown test data. Hence, to address the stable prediction problem, one possible solution is to separate the causal and noncausal variables, and only adopt causal variables for model training and prediction. However, in practice, the analyst always have no prior knowledge on which variables are casual variables and which are noncausal variables.
Variable/Feature selection plays a very important role in machine learning filed. Traditional correlation based feature selection methods utilized either the correlation criteria
Nie et al. (2010) or mutual information criteria Peng et al. (2005)without distinguishing the spurious correlation, leading to unstable prediction across test data that out of training distribution. In the literature of causality, causal discovery and causal estimation techniques can be adopted for causal variables selection. PC
Spirtes et al. (2000), FCI Spirtes et al. (2000) and CPC Ramsey et al. (2012) are three of the most prominent causal discovery methods based on conditional independence (CI) test, but their complexity grow exponentially with the number of variables. Moreover, PC method need assume causal sufficiency, i.e., the assumption that all common causes of observed predictors are observed. Athey et al. (2018); Kuang et al. (2017) can approximately identify causal variables via estimating the causal effect of each variable, but they focused on binary predictors and required that all causal variables are observed.With considering the practical scenarios that causal sufficiency assumption is not met and parts of causal variables are unobserved or unmeasured, in this paper, we propose a novel CI test based causal variable separation method for stable prediction. By assuming that the set of causal variables and noncausal variables are independent, Fig. 2 illustrates the structural causal model (SCM) in our problem. Then, we provides a series of theorems to prove that one can separate the causal variables with a single CI test per variable. Specifically, as shown in Fig. 2, if we know a seed variable is one of the causal variables, then each causal variable should satisfy that , and each noncausal variable should satisfy that . With those theoretical analyses, we present a CI test based causal variable separation method for stable prediction. At a first step, we apply our causal variable separation method on synthetic data, which leads to high precision on causal variable separation, and the precisely separated causal variables bring stability for prediction across unknown test data. In realworld applications, we also demonstrate that our algorithm outperforms baseline algorithms in both causal variable separation task and stable prediction task.
Comparing with previous CI based causal discovery methods Spirtes et al. (2000); Ramsey et al. (2012); Bühlmann et al. (2010); Yu et al. (2019), our method do not rely on the assumption of causal sufficiency and remain unaffected even some causal variables are unobserved. Moreover, our algorithm separate the causal variables with a single CI test per variable, scaling algorithmic complexity from exponential to linear with the number of variables. Comparing with sample based work on stable prediction Kuang et al. (2018, 2020), our method can be applied for continuous settings and separate the causal variables without assumptions on regression model. Our work is similar with a recent paper Mastakouri et al. (2019), which also adopt CI for causal variable selection. But the tailored problems are totaly different in the following ways: (i) Mastakouri et al. (2019) focused on detecting direct and indirect causes of a response variable under i.i.d settings, while our algorithm is designed for separating causal and noncausal variables under the biased settings with sample selection bias; (ii) Mastakouri et al. (2019) is tailored for the problem in which a cause variable of each candidate causal variable is known, while our algorithm assume the independence between causal and noncausal variables, and a seed variable as priori. Moreover, we applied our method to address agnostic distribution shift issue between training and unknown test data for stable prediction.
2 Stable Prediction Problem
Let , denote the space of observed predictors and response variable, respectively. We define an environment
to be a joint distribution
on . In practice, the joint distribution can vary across environments: for .In this paper, we consider a setting where a researcher has a single data set (data from one environment), and wishes to train a model that can then be applied to other environments. This type of problem might arise when a firm creates an algorithm that is then provided to other organizations to apply, for example, medical researchers might train a model and incorporate it in a software product that is used by a range of hospitals; academics might build a prediction model that is applied by governments in different locations. The researcher may not have access to the end user’s data for confidentiality reasons. The problem can be formalized as a stable prediction problem Kuang et al. (2018) as follows:
Problem 1
(Stable Prediction). Given one training environment with dataset , the task is to learn a predictive model that can stably predict across unknown test environments .
In this problem, let , we define as causal variables, and as noncausal variables with the following assumption Kuang et al. (2018):
Assumption 1
Thus, one can address the stable prediction problem by separating causal variables and learning the stable function . But, in practice, we have no prior knowledge on which variables are causal and which are noncausal. In this work, we focus on stable prediction via separating causal variables.
Assumption 2
Causal variables and noncausal variables are independent. Formally, .
Assumption 1 and 2 illuminate that the noncausal variable is independent with response variable during the data generation processing (i.e., ), but it might be spuriously correlated with either response variable, causal variables, or both since sample selection bias problem as shown in Fig. 2. These spurious correlations might vary across environments. Hence, to make a stable prediction, one should guarantee the prediction only depending on the causal variables.
3 Methods
3.1 Background on Causal Graph
Firstly, we revisit key concepts and theorems related to separation and CI in causal graph.
Let represents a causal directed acyclic graph (DAG) with nodes and edges , where a node denotes a variable and an edge represents the direct dependence or causal direction between two variables. In a DAG, refers to that is a cause of and is an effect of .
Definition 1 (separation Pearl (2009))
In a DAG , a path is said to be separated by a set of nodes if and only if (i) contains a chain or a fork such that the middle node is in , or (ii) contains a collider such that the middle node is not in and such that no descendant of is in .
Definition 2 (Conditional Independence)
Given two distinct variables are said to be conditionally independent given a subset of variables (i.e. , if and only if . Otherwise, and are conditionally dependent given (i.e. ).
The connection between separation and CI is established through the following lemma:
Lemma 1 (Probabilistic Implications of Separation Geiger et al. (1990); Pearl (2009))
If variables and are dseparated by in a DAG , then is independent of conditional on in every distribution compatible with the DAG . Conversely, if and are not dseparated by in a DAG , then and are dependent conditional on in at least one distribution compatible with .
3.2 Causal Variables Separation
Based on lemma 1, in this paper, we propose an elaborative but effective causal variables separation algorithm by combining the mechanisms of separation and causality with the following assumption.
Assumption 3
We have prior knowledge on one causal variable. Formally, we know .
Under assumption 3, we have the following theorem to support for precisely separating the set of causal and noncausal variables. Then, the set of causal variables can be applied for stable prediction.
Theorem 1
Proof 1
Assumption 1 implies that noncausal variables are not direct causes of response , but causal variables are the direct causes. Hence, in our causal DAG, there exists a direct edge from each causal variable to response , but have no any edges that directly point to . Assumption 2 guarantees no causal link between any causal and noncausal variables, but the causal structure among causal variables (or noncausal variables) might be very complex and unknown. With considering the sample selection bias is generated based on the response and part of noncausal variables , the causal DAG in our problem is shown in Fig. 2.
From Fig. 2, the path between the seed causal variable and any noncausal variable can be represented as Fig. (a)a, where the causal links between and are unknown, could be very complex or could is exactly if sample selection is based on and . With the definition of separation, we have that and are separated by variable . Hence, for any guaranteed by the lemma 1.
On the other hand, the path between the seed causal variable and any other causal variable can be represented as Fig. (b)b, where the causal links between and are unknown. Similarity, with the definition of separation, we know that the response variable is a collider and cannot separate and . Therefore, with the lemma 1, we have for any .
Overall, we can separate causal and noncausal variables by a single CI test per variable, and belongs to the set of causal variables if , otherwise, is noncausal variable.
Based on theorem 1, we propose a causal variable separation algorithm via one single CI test per variable. The details of our algorithm are summarized in Algorithm 1. With the separated top causal variables, we can learn a predictive model for stable prediction.
Remark 1
From the proof of theorem 1, we know that to identify whether a variable is causal or not, our algorithm only need a single CI test of that variable and a known causal variable conditional on the response variable, with no need to know the other causal variables or common causes of observed variables. Then, we conclude that (i) our algorithm is not affected by the unobserved causal variables, but missing some causal variables would decrease the performance of predictive model on prediction; and (ii) the causal sufficiency assumption is not necessary for our algorithm, but we need to assume the independence between causal and noncausal variables.
Complexity Analysis. Note that our algorithm requires only a single CI test per variable. Therefore, it speeds up the causal variables separation as it scales linearly with the number of variables, hence its complexity is , where is the dimension of observed variables and is a constant denoting the complexity of a single CI test.
Discussions on assumptions. Assumption 1 refers to that the underlying predictive mechanism is invariant across environments, which is the basic assumption for causal variables identification and stable/invariant prediction Peters et al. (2016); Kuang et al. (2018). In assumption 2, we assume the independence between causal variables and noncausal variables, which is critical to our method. In practice, however, one might adopt disentangled representation Thomas et al. (2018) or orthogonal techniques Ahmed and Rao (2012) to guarantee this assumption to be satisfied on feature representation space. We leave this in future work. As for assumption 3, we think it is reasonable and acceptable in real applications. For example, if we want to predict the crime rate, we could know the income is one causal variable. Moreover, one can identify a causal variable as seed variable by estimating its causal effect Athey et al. (2018); Kuang et al. (2017).
4 Experiments
4.1 Baselines
We implement the following variable selection methods as baselines, (i) correlation based methods, including minimal Redundancy Maximal Relevance (mRMR) Peng et al. (2005)
, Random Forest (RF)
Breiman (2001) and LASSO Tibshirani (1996), they would be affected by the spurious correlation between noncausal variable and the response variable, and select noncausal variables for prediction; (ii) causation based methods, including PCsimple^{1}^{1}1Previous CI based methods either need observe all causal variables, or assume causal sufficiency, moreover, with curse of dimensionality. So, we only compare with PCsimple, a prominent CI based method. Bühlmann et al. (2010) and causal effect (CE) estimator Athey et al. (2018); Kuang et al. (2017), they need to assume all causal variable are observed, moreover, PCsimple requires causal sufficiency and with curse of dimensionality; (iii) stable/invariant learning based methods, including invariant causal prediction (ICP)
Peters et al. (2016) and global balancing algorithm (GBA) Kuang et al. (2018, 2020), ICP need multiple training environments for reveal causation and GBA requires tremendous training data for global sample weighting.In our algorithm, we employ causal effect estimator Kuang et al. (2017) to identify one causal variable without assumption 3. Then, we execute CI test with bnlearn method Scutari (2009), denoted as Our+BNCI, and RCIT Strobl et al. (2019) method, denoted as Our+RCIT.
We do not compare with a recent causal variable selection method Mastakouri et al. (2019), since it requires the knowledge of a cause variable of each candidate causal variable, which is not applicable in our problem.
ICP method cannot be applied for variables ranking, but selecting a subset of variables for prediction, where the size of that subset variables is determined by its algorithm. Hence, the experimental results of ICP reported in this paper is based on its unique subset of selected variables.
Based on the selected variables from each algorithm, we apply a linear model^{2}^{2}2For simplification, we use linear model to evaluate the selected variables, other models can also be applied. for prediction to check their stability across unknown test data.
4.2 Evaluation Metrics
To evaluate the performance of causal variable separation/selection, we use precision@k and ranking index of unstable noncausal variable as evaluation metrics. Precision@k refers to the proportion of topk selected variables that are hitting the true causal variables set as follows:
(1) 
where and refer to the set of selected causal variables and true causal variables, respectively. is the ranking index of variable in the selected variables .
Similar to Kuang et al. (2018), we also adopt Average_Error and Stability_Error to measure the performance of stable prediction with the following definition:
(2) 
4.3 Experiments on Synthetic Data
4.3.1 Dataset
To generate the synthetic datasets, we consider the sample size and dimension of observed variables . We first generate the observed variables . From Fig. 2 and assumption 2, we know causal variables and noncausal variables should be independent, but the causal variables could be dependent with each other, and the same to noncausal variables . Hence, we generate with the help of auxiliary variables and
with independent Gaussian distributions as:
(3)  
(4) 
where the number of causal variables and the number of noncausal variables . and represent the and variable in and , respectively.
Then, we generate the response variable as:
(5) 
where , and . The is the indicator function and function returns the modulus after division of by .
From the generation of , we know that is only affected by the causal variables , and independent with the noncausal variables . In real applications, however, some noncausal variables might be spuriously correlated with since sample selection bias as shown in Fig. 2, and their correlation might vary across datasets. To check the stability of algorithms under that practical setting, we generate a set of environments, each with a stable probability , but a distinct spuriously correlation . For simplification, we only set one noncausal variable as the unstable noncausal variable, and change its spuriously correlation across environments.
Specifically, we vary via biased sample selection with a bias rate based on and as shown in Fig. 2. For each sample, we select it with probability , where . If , ; otherwise, .
Note that corresponds to positive spurious correlation between and , while refers to the negative spurious correlation between and . The higher value of , the stronger correlation between and . Different value of refers to different environments. All methods are trained with , but tested across environments with different .
4.3.2 Results
Results on Causal Variables Separation/Selection. We report the results on causal variable selection from two aspects, including the ranking of causal variable with precisionk in Tab.4.3.1 and ranking of unstable noncausal variable in Tab. 4.3.1. The ranking of causal variables determines the average error of prediction across environments, the closer to 1 of precisionk, the better; while the ranking of unstable noncausal variable determines the stability error of prediction across environments, the lower ranking, the better. From Tab. 4.3.1 and 4.3.1, we conclude that: (i) Traditional correlation based variables selection methods, including mRMR, Random Forest and LASSO cannot precisely select the causal variables (with lower precisionk) and rank the unstable noncausal variable with a higher ranking. The main reason is that the spurious correlation is more significant than causation under the sample selection bias. (ii) The performance of PCsimple is similar to correlation based method, since it’s hard to search the optimal solution for PCsimple via naively random search, moreover, it relies on the causal sufficiency assumption and needs to observed all causal variables. (iii) The performance of causation based methods, including CE and GBA, is better than those correlation based methods with higher precisionk and lower ranking of unstable noncausal variable. Since by revealing part of causations among variables, they can reduce spurious correlations in training data. But their performances are still worse than our methods in high dimensional settings, since they need enough training data for a better sample rewighting, moreover, they need to observed all causal variables. (iv) Our methods achieve the best performance for the separation/selection of causal variables (with highest precisionk) and the ranking of unstable noncausal variable.
Dimension  p=10  p=20  p=40  p=80  
Metrics  Average_Error  Stability_Error  Average_Error  Stability_Error  Average_Error  Stability_Error  Average_Error  Stability_Error 
mRMR  1.058  0.548  1.145  0.599  1.179  0.625  1.177  0.619 
RF  0.994  0.506  1.110  0.576  1.174  0.622  1.177  0.619 
LASSO  0.994  0.506  1.055  0.541  1.170  0.618  1.177  0.619 
PCsimple  1.039  0.536  1.100  0.570  1.175  0.622  1.178  0.619 
CE  0.413  0.019  1.055  0.541  1.132  0.593  1.168  0.613 
ICP  0.680  0.313  1.082  0.558  1.172  0.621  1.176  0.620 
GBA  0.413  0.019  1.055  0.541  1.132  0.594  1.167  0.612 
Our+BNCI  0.413  0.019  0.644  0.049  0.879  0.111  1.017  0.160 
Our+RCIT  0.413  0.019  0.644  0.049  0.909  0.121  1.020  0.161 
Results on Stable Prediction. With the variable ranking list form each algorithm, we select top ranked variables to evaluate their performances on stable prediction across unknown test environments, where is set as the number of causal variables (i.e., ). Fig. 3 and Tab. 3 demonstrate the experimental results on stable prediction. From Fig. 3, we find that (i) the performance of our methods are worse than baselines when . This is because the spurious correlation between unstable noncausal variable and the response variable are highly similar between training data () and test data when , and that correlation can be exploited for improving predictive performance; (ii) the performance of our methods are much better than baseline when , where that spurious correlation are totaly different between training () and test data , leading to unstable prediction on baselines; (iii) our methods achieve the most stable prediction across all test data, since our algorithm can precisely separate the causal variables and achieve the lowest ranking of unstable noncausal variable as reported in Tab.4.3.1 and Tab. 4.3.1.
To clearly demonstrate the advantages of our algorithm on stable prediction, we report the detail results under different synthetic settings in Tab. 3. From the results, we can conclude that our algorithm can make stable prediction across unknown environments via causal variable separation.
4.4 Experiments on RealWorld Data
Dataset. To evaluate the performance of our algorithm in realworld datasets, we apply it to a Parkinson’s telemonitoring dataset^{3}^{3}3https://archive.ics.uci.edu/ml/datasets/parkinsons+telemonitoring, which was wildly used for the problem of domain generalization Muandet et al. (2013); Blanchard et al. (2017) and other regression tasks Tsanas et al. (2009). This dataset consists of biomedical voice measurements from 42 patients with earlystage Parkinson’s disease recruited for a sixmonth trial of a telemonitoring device for remote symptom progression monitoring. For each patient, there are about 200 recordings, which were automatically recorded in the patients’ home. The task is to predict the clinician’s motor UPDRS scoring of Parkinson’s disease symptoms from patients’ features, including their age, gender, test time and many other measures.
Experimental Settings. In our experiments, we set the motor UPDRS scoring as the response variables . To test the stability of all methods, we generate different environments by biased data separation based on different patients. Specifically, we separate the whole 42 patients into 4 patients’ groups, including group 1 (G1) with recordings from 21 patients, and other three groups (G2, G3 and G4) are all with recordings from different 7 patients, where the different groups correspond to different environments. Considering a practical setting where a researcher has a single data set and wishes to train a model that can then be applied to other environments, in our experiments, we trained all models with data from environment G1, but tested them on all 4 groups.
Experimental Results. We report the experimental results of RMSE with top ranked variables in Figure 4. Fig. (a)a shows that correlation based methods (LASSO, mRMR and RF) outperform causation based methods (GBA and our method), this is because the training and test have the similar distribution on env. G1, hence the spurious correlation between noncausal variables and response variable can bring positive power for prediction. Moreover, we find ICP method achieves good performance in env. G1 since it cannot differentiate the spurious correlation from only one training environment. Fig. (b)b, (c)c and (d)d demonstrate that causation based methods are better than correlation based methods when the test distributions are out of the training one, and our method, especially the method “our+RCIT”, can almost achieve the best performance. The main reason is that spurious correlation on training could be different on testing, while causation based methods could discover causal variables for more stable prediction across environments, and our method performs the best on causal variables ranking and separation. In addition, we observed that in noni.i.d settings^{4}^{4}4The test distribution is different from the training one., the prediction performance might seriously decrease as inputting more selected variables, since some selected variables could be spuriously correlated with the response and unstable across environments.
5 Conclusion
In this paper, we focus on the problem of stable prediction via leveraging a seed variable for causal variable separation. We argue that most of traditional prediction methods and variable selection methods are correlation based, resulting in instability problem on prediction across unknown environments. By assuming that the casual variables and noncausal variables are independent, in this paper, we proposed a causal variable separation algorithm with a single CI test per variable, and provide a series of theorems to prove that our algorithm can precisely separate the causal variables. We also demonstrate that the precisely separated causal variables from our algorithm can bring stable prediction across unknown test data. The experimental results on both synthetic and realworld datasets show that our algorithm outperforms the baselines for causal variables separation and stable prediction.
References
 Orthogonal transforms for digital signal processing. Springer Science & Business Media. Cited by: §3.2.
 Approximate residual balancing: debiased inference of average treatment effects in high dimensions. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80 (4), pp. 597–623. Cited by: §1, §3.2, §4.1.
 Controlling selection bias in causal inference. In Artificial Intelligence and Statistics, pp. 100–108. Cited by: Figure 1.
 Domain generalization by marginal transfer learning. arXiv preprint arXiv:1711.07910. Cited by: §4.4.
 Random forests. Machine learning 45 (1), pp. 5–32. Cited by: §4.1.
 Variable selection in highdimensional linear models: partially faithful distributions and the pcsimple algorithm. Biometrika 97 (2), pp. 261–278. Cited by: §1, §4.1.

Identifying independence in bayesian networks
. Networks 20 (5), pp. 507–534. Cited by: Lemma 1.  Stable prediction across unknown environments. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1617–1626. Cited by: §1, §1, §1, §2, §2, §3.2, §4.1, §4.2.
 Estimating treatment effect in the wild via differentiated confounder balancing. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 265–274. Cited by: §1, §3.2, §4.1, §4.1.
 Stable prediction with model misspecification and agnostic distribution shift. In ThirtyFouth AAAI Conference on Artificial Intelligence, Cited by: §1, §1, §4.1.
 Selecting causal brain features with a single conditional independence test per feature. In Advances in Neural Information Processing Systems, pp. 12532–12543. Cited by: §1, §4.1.
 Domain generalization via invariant feature representation. In International Conference on Machine Learning, pp. 10–18. Cited by: §1, §4.4.
 Efficient and robust feature selection via joint l2, 1norms minimization. In Advances in neural information processing systems, pp. 1813–1821. Cited by: §1.
 Causality. Cambridge university press. Cited by: Definition 1, Lemma 1.
 Feature selection based on mutual information: criteria of maxdependency, maxrelevance, and minredundancy. IEEE Transactions on Pattern Analysis & Machine Intelligence (8), pp. 1226–1238. Cited by: §1, §4.1.

Causal inference by using invariant prediction: identification and confidence intervals
. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78 (5), pp. 947–1012. Cited by: §1, §3.2, §4.1.  Adjacencyfaithfulness and conservative causal inference. arXiv preprint arXiv:1206.6843. Cited by: §1, §1.
 Invariant models for causal transfer learning. The Journal of Machine Learning Research 19 (1), pp. 1309–1342. Cited by: §1.
 Learning bayesian networks with the bnlearn r package. arXiv preprint arXiv:0908.3817. Cited by: §4.1.
 Causation, prediction, and search. MIT press. Cited by: §1, §1.
 Approximate kernelbased conditional independence tests for fast nonparametric causal discovery. Journal of Causal Inference 7 (1). Cited by: §4.1.
 Disentangling the independently controllable factors of variation by interacting with the world. arXiv preprint arXiv:1802.09484. Cited by: §3.2.
 Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58 (1), pp. 267–288. Cited by: §4.1.
 Accurate telemonitoring of parkinson’s disease progression by noninvasive speech tests. IEEE transactions on Biomedical Engineering 57 (4), pp. 884–893. Cited by: §4.4.
 Causalitybased feature selection: methods and evaluations. arXiv preprint arXiv:1911.07147. Cited by: §1.
Comments
There are no comments yet.