Causal discovery from natural phenomena is a fundamental problem across many domains, such as biology Sachs et al. (2005), economics Pearl (2009) and genetics Peters et al. (2017). Although randomized control experiments are the gold standard for inferring causal relationships Hernan and Robins (2010), it is impossible or very expensive in many fields such as patient treatment Hernán and Robins (2020) and genetics Peters et al. (2017). Rather than designing randomized experiments, causal discovery algorithms attempt to infer causality automatically from passive observational data.
Typically, the causal relationships among a set of variables could be represented as a directed acyclic graph (DAG). To estimate the graph from observational data, there are mainly two classes of approaches in the literatureTriantafillou and Tsamardinos (2016): constraint-based and score-based. Constraint-based approaches use conditional independence tests to check the existence of causal relationship between each pair of variables and try to find a graph that entails all the corresponding -separations Pearl (2010). In contrast, score-based approaches try to find a graph that maximizes the likelihood of the data given . Equivalently, they attempt to search the ”optimal” causal graph from the directed acyclic graph space (denoted as
) in a combinatorial optimization manner by minimizing a score function
The problem in Eq.(1) is a typical NP-hard problem because the number of DAGs increase super-exponentially in the number of graph nodes Chickering et al. (2004). Many methods have been proposed to solve this problem by designing various score functions Peters et al. (2014); Huang et al. (2018). Recently, Zheng et al. Zheng et al. (2018)
recast this combinatorial optimization problem into a continuous gradient-based optimization one which allows for nonlinear causal relationships parameterized by neural networks. Furthermore, Zhu et al.Zhu et al. (2020) proposed to use a reinforcement learning agent to automatically determine the search direction in this continuous optimization problem to find the DAG with the best score.
However, existing DAG searching methods, either based on combinatorial or continuous optimization, seldom consider settings with incomplete data that the observational data used for causal discovery have missing values. This phenomenon is ubiquitous in many real-world situations, which can significantly impair the final learning performance and even make the algorithms fail. For example, according to our evaluation, the algorithm in Zhu et al. (2020) will not work in such a setting. To deal with the challenge of data with missing values, in this paper, we propose an approach that is able to learn the causal relationship from incomplete data.
Back to the information processing procedure in our human brain, we can still obtain some basic information though the observational data has missing values. For example, if a patient is an elderly patient and does not present with a good level of fitness we might assume they are not doing any exercise. To imitate such a brain information processing procedure about incomplete data, we propose in this paper a Encoder that uses neural networks to encode the non-missing information into a feature representation. The derived representation is then integrated into a reinforcement learning (RL) framework for searching the best causal graph. According to Zhu et al. (2020), the reason that the RL agent works well lies in that the stochastic policy can be updated properly by using the rewards. Hence, the RL agent can determine automatically where to search.
The contributions of this paper are as follows:
an RL-based approach for learning causal graphs from incomplete data.
an ad-hoc encoder for extracting features from incomplete observational data for the sake of causal discovery. As a result, the whole RL framework could be optimized in an end-to-end manner while using incomplete data.
2 Related Work
Most of the existing DAG learning algorithms could be divided into two categories: score-based and constraint-based approaches. Score-based algorithms search the space of all possible structures to maximize a decomposable score using greedy, local, or some other search algorithms. A typical example is the Greedy-Equivalent-Search (GES) algorithm Chickering (2002); Nandy et al. (2018). Some popular scoring functions are the Bayesian Dirichlet scorede Campos and Ji (2010), the Akaike Information Criterion Sakamoto et al. (1986), the Bayesian Information Criterion (BIC) Schwarz and others (1978) or Minimum Description Length (MDL) score Chickering (2002)
. In contrast, constraint-based methods exploit the property of Bayesian networks that edges encode conditional dependencies. Typical examples include the well-known PC algorithmSpirtes et al. (2000) and the FCI algorithm Zhang (2008). In addition, there is a suite of hybrid algorithms that combine score-based and constraint-based methods. A prominent example is the Max-Min Hill-Climbing (MMHC) algorithm Tsamardinos et al. (2006). The main idea is find the skeleton using a constraint-based method and then orient the edges by a search-based method Kalisch and Bühlmann (2014).
As we have discussed in the previous section, since the number of DAGs increase super-exponentially in the number of graph nodes, causal discovery, as a combinatorial search problem, is NP-hard. As a result, traditional algorithms have mainly focused on small graphs and discrete variables. Recently, Zheng et al. Zheng et al. (2018) proposes a new continuous optimization approach called NOTEARS to transform the discrete search procedure into an equality constraint. This is a pioneer work to enable the recent neural network to be applied to solve DAG learning. For example, following NOTEARS, DAG-GNN Yu et al. (2019) uses a graph neural network to solve the DAG learning. GraN-DAG Lachapelle et al. (2019) extends it to deal with nonlinear relationships between variables using neural networks.
However, these methods all assume that the observational data has no missing values. This assumption may not be realistic in real applications. Observational data in many applications always has missing values Josse et al. (2018). For instance, patient observational data is missing since many clinician systems think filling the form is unrelated to the treatment of illness and they may not think highly of these data, therefore, they usually do not filled the form completely. This paper aims to release this assumption to discover the causal relationships from incomplete observational data.
Our approach falls into the framework of continuous optimization based causal discovery using reinforcement learning. In this section, we firstly introduce the problem formulation and notations. Details of the proposed actor that generates candidate causal graphs from observational data with missing values is then described. Finally, we explain how the critic helps to evaluate the generated graphs and thus provides us rewards for training the whole neural network. The overall framework of our causal discovery approach is illustrated in Figure 1.
3.1 Problem Formulation
Given an observational dataset , where each individual unit has - dimensional attributes. Each attribute is associated with a node in a -node DAG , and the observed value of is a sample of this DAG, totally samples. To infer causal relationships among these attributes, we want to find an adjacency matrix . Each value of the adjacency matric describes the causality between these two attributes. For example, means the attribute has an effect to attribute and its weight is .
Since the samples usually have missing values on the attributes, the observational data contains missing values. Missing values are toxic to the existing DAG learning methods which will make the existing optimization-based or neural network methods fail.
To cope with this challenge, we propose a reinforcement learning framework to discover the causal structure from observational data with missing values. The overall framework is illustrated in Figure 1. In particular, given a sample of incomplete observational data, an encoder-decoder (Actor) generates a robust feature and a graph. Then, the feature is input into a value network to compute the value and the graph input into a reward function to calculate the reward. The value and reward are utilized to calculate the loss to optimize the whole network.
3.2 Reinforcement Learning: Actor
In the RL-based causal discovery framework, the objective of the actor is to generate candidate causal graphs from a given observational data. In this paper, our proposed actor is an encoder-decoder neural network that is able to handle missing values in the data.
Encoder The encoder aims to extract robust features by using the incomplete data. The proposed encoder neural network consists of two parts: the imputation network (ImNet) and the feature extraction network (FeatNet). The objective of ImNet is to impute a complete data from the current incomplete data, while the FeatNet is to extract the features from the imputation data.
The imputation network contains three fully-connected layers. The neurons of each layer are the same to the attribute dimensionand the output is the same dimension as the input data. The first two layers use relu
as the activation function and the final layer uses thesigmod function.
Given the observational data with missing values, , we define a mask matrix where if the element is observed and otherwise. Inside the ImNet, we firstly concat with , and then generate the initial output by using fully-connected layers. The main idea is that we let the neural network to automatically learn a feature from both the and , and then convert the feature into an matrix of imputation data. Through this process, we obtain an initial output with the same dimension to . Formally,
After the initial output is generated, the final complete data is calculated by
That is, the final complete data is obtained by taking the partial observation and replacing each missing value with the corresponding value of .
In practice, we initialize the ImNet with pre-trained parameters for the sake of faster training. The pre-trained parameters are learned by a generative adversarial network architecture where our ImNet acts as the generator. We use the same discriminator as in Yoon et al. (2018). On one hand, the goal of the generator is to accurately impute missing data; on the other hand, the discriminator tries to distinguish between imputed and observed elements in the input data. They are trained iteratively in an adversarial schema.
Encoder: FeatNet In our encoder-decoder neural network for generating causal graphs, we adopt the Transformer structure Vaswani et al. (2017) for extracting robust features from the imputed data. Our empirical results indicate that the self-attention network is capable of extracting robust features from the imputed observational data. Overall, the input of the feature extracting network (FeatNet) is the imputed data matrix outputed from ImNet and its output is denoted as , where is the feature dimension of each attribute and is the feature of attribute node .
Decoder The decoder aims to generate graphs from the features. Inspired by Zhu et al. (2020), we use a single fully-connected layer as our decoder. The reason is that the one-layer network already contains enough ability to build graph Zhu et al. (2020). Moreover, a deep graph network will be difficult to train and the output may be indistinguishable Li et al. (2018). Formally, the decoder used in this paper is
where are neuron matrix that need to be trained. is the dimension of neurons in the decoder and is the dimension of each feature from the encoder. Then, we input
into a logistic sigmoid function
to get a probabilityof an edge emitting from to . Finally, to generate a binary adjacency matrix
, we sample according to a Bernoulli distribution with probability.
3.3 Reinforcement Learning: Critic
The job of the critic is to evaluate actions generated by the actor so that the actor can update its policy based on the evaluation score Konda and Tsitsiklis (2000). In this paper, the Critic has two objectives: generating a value score for encoded features (Value) and calculating a reward score for the decoded graphs (Reward).
Vnet We use a neural network to calculate the value score of the encoded feature. This value score is used to calculate a discounted reward as Loss in the following section. That is, the update of Actor need to consider both the feature quality and graph quality. In this paper, the value neural network (Vnet) consists of two fully-connected layers. The input is the features from encoder and the output is a number that evaluate the value of the encoded features.
where is the fully-connected layer, is the non-linear activation function and is the feature from the encoder neural network.
Reward function Given a candidate graph generated by the decoder neural network, the reward function evaluates how well the graph fits the observational data. In this paper, we follow Lachapelle et al. (2019); Zheng et al. (2018); Zhu et al. (2020) and use the following score function:
where is the least square loss that maximizes the likelihood for a Gaussian model used in Zheng et al. (2018). is the number of edges in .
Acyclic constrain: To make sure the generated graph is a DAG, we follow Zheng et al. (2018) and also introduce the following acyclic constraint to the adjacency matrix .
where calculate the trace, is the matrix exponential and is the Hadamard product.
Rewards The final reward is calculated by incorporating the above score function and additional Acyclic constraints. Formally, it is defined as
where is the indicator function, and are two penalty coefficients.
To better optimize the actor, we use a discounted reward as the final reward. This discounted reward serves as a loss function for optimizing the actor.
where is the regularized entropy generated graph from the decoder Bernoulli distribution.
Minimizing the loss by backward optimizing the neural network is equivalent to optimizing the actor in reinforcement learning. The best actor is achieved iff
The original actor-critic reinforcement learning is to use the score to compute the gradient and use the gradient to update its policy in an approximate gradient direction Konda and Tsitsiklis (2000). The actor is best when the gradient iff . This paper utilizes the graph of neural network to automatically calculate the gradient and use backward to optimize the actor. Therefore, iff , we obtain the best actor. ∎
4 Synthetic Incomplete Data Generation
To systematically demonstrate the effectiveness of the proposed method, we use the following model to generate synthetic datasets. Consider a causal graph with node and each node is a variable , we generate the value of by the following structural equation:
where is a linear or nonlinear function and the noises
are mutually independent generated from Gaussian or non-Gaussian distributions.
is a vector containing the elementssuch that there is an edge from to in the DAG . The generated data is a matrix which consists of a number of row vectors . Each row vector uses the same generation function in 11.
Specifically, the generative model in Eq.(11) belongs to standard linear-Gaussian model family Peters et al. (2017) if all are linear and are Gaussian distribution. When the functions are linear but the noise are non-Gaussian, we obtain the linear non-Gaussian models described in Hoyer et al. (2009); Shimizu et al. (2006). In this paper, all the variables are scalars and it is straightforward to extend to more complex cases with a proper defined generation function.
To simulate the missing values of observational data, inspired by Mayer et al. (2020), we randomly remove the attribute values for each sample with the missing probability of 20%. The goal of causal discovery in this paper is, given the data vectors with missing values, to infer as much as possible about the generating mechanism; in particular, we seek to infer the generating graph .
5.1 Baselines and Evaluation Metrics
Most of the existing causal discovery methods will totally fail in settings with missing values. To deal with this problem, a straightforward idea is to impute the missing values first and then run the state-of-the-art causal discovery methods. Several experiments were conducted to compare our method with this straightforward idea. We selected GAIN as the imputation method since it has shown robust performances for missing data imputation Yoon et al. (2018). Overall, baseline methods we used for experimental comparison include:
GAINYoon et al. (2018)+NOTEARS Zheng et al. (2018). Since NOTEARS is widely used and obtains robust causal discovery results, we compare with it on incomplete data. The NOTEARS uses the default parameters as the released code. The input data is a completed data with imputation data filled in the positions of missing value.
In the proposed approach, we use pre-trained parameters to initialize the ImNet as discussed in Section 3.2. The pre-trained parameters were generated from a generative adversarial architecture. Then, the input is the incomplete observational data and the whole neural networks were optimized together using the final loss from the Critic. We used a batch size of 64 and hidden dimension of the ImNet is same to the data dimension.
Three metrics were adopted to evaluate the estimated graphs:
False Discovery Rate (FDR): defined as the expected ratio of falsely discovered positive hypotheses to all those discovered. In the context of estimating the structure of a DAG, a positive hypothesis means that the estimated DAG has an same edge connection with the ground-truth DAG, and a negative hypothesis could be that the estimated DAG has an edge that does not exist in the true DAG. The FDR is the expected proportion of the falsely discovered connections to all those discovered Li and Wang (2009). For FDR, the lower the better.
True Positive Rate (TPR): defined as the expected proportion of the actual correctly discovered edges to all those discovered. For TPR, the higher the better.
Structural Hamming Distance (SHD): The structural Hamming distance Tsamardinos et al. (2006) is the number of edges that do not coincide in two partially directed acyclic graphs. Partially directed acyclic graph Peters and Bühlmann (2015) is a graph with no directed cycle. Specifically, there is no pair , such that there are directed paths from to and from to . In this paper, we used partially directed acyclic graphs. Since the SHD considers both false negatives and false positives, a lower SHD indicates a better estimate of the causal graph.
5.2 Linear Models with Gaussian and Non-Gaussian Noises
Using the data generation model in Section 4 and following the settings of Zhu et al. (2020), we generated the datasets to test the performance on linear models with Gaussian and non-Gaussian noise. In particular, we randomly remove the values of each sample in the datasets to simulate the missing value problem. For Gaussian noise, we use a standard Gaussian distribution. For the non-Gaussian noise, we follow ICA-LiNGAM Shimizu et al. (2006) to firstly generate samples from a Gaussian distribution and secondly use a power non-linearity to make them non-Gaussian. We generate samples as datasets and conduct a random permutation.
12 nodes: Firstly, we conduct comparison experiments on a graph with node number
. In this experiment, the input of each epoch is randomly selectedsamples from the whole datasets that generated before and the total epoch number is 20000. Similar to NOTEARS Zheng et al. (2018) and RL-BIC2 Zhu et al. (2020), we use to prune the estimated edges.
Table 1 shows the comparison results on linear non-Gaussian and linear-Gaussian data models with missing values. Our method obtains better performance than NOTEARS and RL+BIC2. The better results means that our novel encoder integrating into the reinforcement learning framework performs better than directly combining with imputation method. The reason is that both the imputation sub-network and feature sub-network could be optimized together to learn a better feature for the incomplete data. Then, the better feature could be input into the decoder to find a better graph. The low SHD demonstrates our method can robustly generate causal graph from the incomplete data on linear models.
30 nodes: Secondly, we conduct comparison experiments on graph node and the adjacency matrix is generated from Bernoulli distribution with parameter . According to Zheng et al. (2018); Yu et al. (2019); Zhu et al. (2020), this edge probability choice corresponds to the fact that large graphs usually have low edge degrees in practice. In this experiment, the input of each epoch is randomly selected samples from the whole datasets that generated before and the total epoch number is 40000. Since this edge probability choice corresponds to the fact that large graphs usually have low edge degrees in practice, following Zhu et al. (2020), we we add to each edge a common bias term initialized to .
Table 2 shows the comparison results. Directly combining imputation and causal discovery works not well when the number of graph node increase, RL+BIC2 only obtains 0.538 in TPR and 108 in SHD. NOTEARS obtains 89 in SHD which is worse than RL+BIC2 in solving large causal graph discovery. In contrast, our method obtains much better performance than both of these direct combination strategy, which improves at least 16.7% in TPR and 32 in SHD. Compared with NOTEARS, we improve 64.1% in TPR and 42 in SHD, which is a great performance gain.
5.3 Nonlinear Models with Quadratic Functions and Gaussian Process
Nonlinear model with quadratic functions. Furthermore, we evaluated the performance of our approach on data generated from non-linear models with quadratic functions. We used the same data generation process with Zhu et al. (2020) and generated samples, each has 10 attributes. To simulate missing data, we randomly remove the values for each sample with a missing probability of . In this situation, identifiability of the true causal graph is guaranteed Peters et al. (2014).
Experimental results are listed in Table 3. Compared with NOTEARS and RL+BIC2, our approach obtains better performance with a gain of 31% in TPR and 6 in SHD. With a significantly lower SHD, we argue that our approach can robustly generate causal graph from incomplete data of nonlinear model with quadratic functions.
|Quadratic function||Gaussian Process|
Nonlinear model with Gaussian process. We also compared our approach with other baselines on nonlinear models with Gaussian process. Using the simulation process in Zhu et al. (2020), we further generated missing values to get the observational data by randomly removing values from each sample with a missing probability of 20%.
As we can see from Table 3, our method achieves a performance gain of 43.2% in TPR and improves 19 in SHD. In addition, we can also conclude from the result that, by direct combination with data imputation, state-of-the-art causal discovery methods (NOTEARS and RL+BIC2) all perform poorly in nonlinear models when there are missing values. The reason is that their data imputation is a stand-alone stage which may be not capable to impute a suitable data for the causal graph discovery. Our method combines both data imputation and feature extraction into an encoder and the encoder integrate into a reinforcement learning framework to search the best DAG to fit the incomplete data. The experiments demonstrate that we can obtain much better performance than the direct combination of imputation and DAG learning methods. Moreover, the low SHD demonstrates our method can robustly generate causal graph from the incomplete data based on non-linear models.
5.4 Real Data
We also validate our proposed approach on a real-world dataset which is designed for reconstruction of causal graphs from physiologically relevant primary single cells Sachs et al. (2005). Data were collected after a series of inhibitory interventions and stimulatory cues. Specifically, researchers used a fixed stimulation process to stop cell reactions for , and then recorded and analyzed the effects of each condition on the intracellular signaling networks of human primary naive cell. This dataset is well accepted by the biological community and widely utilized in many domains Zhu et al. (2020). In our experiment, we consider the general perturbation that is to activate cells and induce proliferation and cytokine production. The observational data contains 853 samples and 11 attributes. According to Sachs et al. (2005), the ground truth causal graph has 17 edges evaluated manually with high accuracy. To simulate missing values, we randomly removed the value of each sample with a probability of 20%.
We run the same graph prune process as the above experiment setting during the training for both RL+BIC2 and our approach. The experiment results are listed in Table 4. As we can see from the table, our proposed approach obtains significantly better performance than its competitors. Noticeably, the TPR results of all methods are very low for this real dataset. A possible reason is that the ground truth graph is very sparse with only 17 edges. Since the SHD considers both false negatives and false positives, the lower SHD in our result demonstrates the advantage of our approach in handling real incomplete data further indicates its ability for robustly generating causal graphs from the incomplete real data.
In this paper, we proposed a causal discovery approach for learning causal graphs from observational data with missing values. In particular, we develop an ad-hoc encoder network for extracting a robust feature representation of the incomplete data, and then integrate the learned representation into a reinforcement learning framework to search for the best causal graph. The proposed approach uses deep neural networks to imitate the human brain information process on incomplete data. Experiments on both synthetic and real datasets demonstrate the effectiveness of our approach which can obtain as much as 43.2% performance gain compared to existing state-of-the-art approaches.
Large-sample learning of bayesian networks is np-hard.
Journal of Machine Learning Research5 (Oct), pp. 1287–1330. Cited by: §1.
- Optimal structure identification with greedy search. Journal of machine learning research 3 (Nov), pp. 507–554. Cited by: §2.
Properties of bayesian dirichlet scores to learn bayesian network structures.
Twenty-Fourth AAAI Conference on Artificial Intelligence, Cited by: §2.
- Causal inference: what if. Boca Raton: Chapman & Hall/CRC. Cited by: §1.
- Causal inference. CRC Boca Raton, FL;. Cited by: §1.
- Nonlinear causal discovery with additive noise models. In Advances in neural information processing systems, pp. 689–696. Cited by: §4.
- Generalized score functions for causal discovery. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1551–1560. Cited by: §1.
- Introduction to the special section on missing data. Statistical Science 33 (2), pp. 139–141. Cited by: §2.
- Causal structure learning and inference: a selective review. Quality Technology & Quantitative Management 11 (1), pp. 3–21. Cited by: §2.
- Actor-critic algorithms. In Advances in neural information processing systems, pp. 1008–1014. Cited by: §3.3, §3.3.
- Gradient-based neural dag learning. arXiv preprint arXiv:1906.02226. Cited by: §2, §3.3.
- Controlling the false discovery rate of the association/causality structure learned with the pc algorithm. Journal of Machine Learning Research 10 (Feb), pp. 475–514. Cited by: 1st item.
Deeper insights into graph convolutional networks for semi-supervised learning. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §3.2.
- MissDeepCausal: causal inference from incomplete data using deep latent variable models. arXiv preprint arXiv:2002.10837. Cited by: §4.
- High-dimensional consistency in score-based and hybrid structure learning. The Annals of Statistics 46 (6A), pp. 3151–3183. Cited by: §2.
- Causality. Cambridge university press. Cited by: §1.
- An introduction to causal inference. The international journal of biostatistics 6 (2). Cited by: §1.
- Structural intervention distance for evaluating causal graphs. Neural computation 27 (3), pp. 771–799. Cited by: 3rd item.
- Elements of causal inference: foundations and learning algorithms. MIT press. Cited by: §1, §4.
- Causal discovery with continuous additive noise models. The Journal of Machine Learning Research 15 (1), pp. 2009–2053. Cited by: §1, §5.3.
- Causal protein-signaling networks derived from multiparameter single-cell data. Science 308 (5721), pp. 523–529. Cited by: §1, §5.4.
- Akaike information criterion statistics. Dordrecht, The Netherlands: D. Reidel 81. Cited by: §2.
- Estimating the dimension of a model. The annals of statistics 6 (2), pp. 461–464. Cited by: §2.
- A linear non-gaussian acyclic model for causal discovery. Journal of Machine Learning Research 7 (Oct), pp. 2003–2030. Cited by: §4, §5.2.
- Causation, prediction, and search. MIT press. Cited by: §2.
- Score-based vs constraint-based causal learning in the presence of confounders.. In CFA@ UAI, pp. 59–67. Cited by: §1.
- The max-min hill-climbing bayesian network structure learning algorithm. Machine learning 65 (1), pp. 31–78. Cited by: §2, 3rd item.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3.2.
- Gain: missing data imputation using generative adversarial nets. arXiv preprint arXiv:1806.02920. Cited by: §3.2, 1st item, 2nd item, §5.1.
- Dag-gnn: dag structure learning with graph neural networks. arXiv preprint arXiv:1904.10098. Cited by: §2, §5.2.
- On the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias. Artificial Intelligence 172 (16-17), pp. 1873–1896. Cited by: §2.
- DAGs with no tears: continuous optimization for structure learning. In Advances in Neural Information Processing Systems, pp. 9472–9483. Cited by: §1, §2, §3.3, §3.3, 1st item, §5.2, §5.2.
- Causal discovery with reinforcement learning. In International Conference on Learning Representations, Cited by: §1, §1, §1, §3.2, §3.3, 2nd item, §5.2, §5.2, §5.2, §5.3, §5.3, §5.4.