Causal structure learning has important applications in many areas such as genetics Koller and Friedman ; Peters et al. , biology Sachs et al. , and economics Pearl . An effective way to discover causal relations amongst variables is to conduct controlled experiments, which however is expensive and even ethically prohibited in certain scenarios. Learning causal structure from purely observational data has attracted much research attention in the past decades Pearl ; Spirtes et al. ; Shimizu et al. ; Janzing and Schölkopf .
Two major classes of structure learning methods from observational data are constraint- and score-based. Constraint-based methods, such as PC and FCI Spirtes et al. , first use conditional independence tests to learn the skeleton of the underlying casual graph and then orient the edges based on a series of orientation rules Meek ; Zhang . Score-based methods, like GES Chickering , assign a score to each causal graph according to some pre-defined score function Heckerman et al. ; Chickering and Heckerman ; Bouckaert  and then search over the space of causal directed acyclic graphs (DAGs) to find the one with the optimal score. However, finding the causal graph with optimal score is generally NP-hard, largely due to the combinatorial acyclicity constraint on the causal graph. Although the performance of several constraint- and score-based methods is guaranteed with infinite data and suitable assumptions Chickering and Heckerman ; Spirtes et al. , the inferred graphs are usually less satisfactory in practice. More recently, Zheng et al. 2019]
has also combined this approach with graph neural networks to model nonlinear causal relationships.
In this work, we propose a new gradient-based structure learning method which generalizes the recent gradient-based methods to a graph autoencoder framework that allows nonlinear structural equation models and vector-valued variables. We demonstrate that on synthetic datasets, our proposed method outperforms other gradient-based methods significantly, especially on relatively large causal graphs. We analyze the training time of our method and observe a near linear training time when scaling the graph size up to 100 nodes.
2 Gradient-Based Causal Structure Learning
This section introduces the recently developed gradient-based methods for causal structure learning.
Let be a causal DAG with vertices , where are vector-valued variables in . In this work, we focus on causal structure learning under additive noise models (ANMs) and adopt the same data generating procedure as in Hoyer et al. , that is,
where denotes the set of variables having a direct edge to in , is a vector-valued function, and is an additive noise. We assume that ’s are jointly independent, and also write and .
Zheng et al. 
are probably the first to transform the combinatorial optimization problem of score-based methods into a continuous one for linear structural equation model (SEM), which adopts the following data generation model
assuming both . Here is the coefficient vector and denotes the weighted adjacency matrix of the linear SEM. To ensure the acyclicity of , a smooth constraint on is proposed as
where denotes the matrix exponential of a matrix and denotes the Hadamard product. To learn the underlying causal structure, Zheng et al.  considers the regularized score function consisting of least squares loss and the norm, which, together with the above acyclicity constraint, leads to a continuous optimization problem:
where is the sample size and denotes the -th observed sample of .
The above problem can then be solved by standard numeric optimization methods such as the augmented Lagrangian method Bertsekas . This approach is then named Non-combinatorial Optimization via Trace Exponential and Augmented lagRangian for Structure learning (NOTEARS) by the authors. Though numeric optimization methods may not produce exact acyclic graphs, i.e., can be very small (e.g., ) but not exactly zero, post-processing like thresholding can be used to meet the hard acyclicity constraint.
To generalize the above model to nonlinear cases, Yu et al.  proposed DAG-GNN, a generative model given below
are point-wise (nonlinear) functions. Causal structure learning is then formulated under the framework of variational autoencoders with multilayer perceptrons (MLPs) to model the causal relations, and the objective is to maximize the evidence lower bound under the acyclicity constraint. Notice that hereis regarded as the latent variable and its dimension can be selected to be lower than , the number of variables.
3 Graph Autoencoder for Causal Structure Learning
We present an alternative generalization of NOTEARS to handle nonlinear causal relations. We consider the case with each variable being scalar-valued (i.e., ) in Section 3.1 and then the the more general case () in Section 3.2.
3.1 Generalization of NOTEARS to Nonlinear Cases
We can rewrite problem (1) as
where denotes the data generating model w.r.t. the -th observed sample of and denotes the parameters associated with . For NOTEARS, we would have for linear SEM.
One way to extend in Eq. (3) to handle nonlinear relationships is to set
where is a nonlinear function and in this work we use an MLP for this function. This generalization is a special case of causal additive model studied in Bühlmann et al. . While it is possible to learn different MLPs for each variable, we use shared weights across all MLPs, similar to Yu et al. .
3.2 Connection with Graph Autoencoder
Eq. (6), together with Eq. (3), form a GAE (similar to Cen et al. ) trained with reconstruction error where and are respectively the variable-wise encoder and decoder, and the message passing operation is applied at the latent representation
. In our case, the message passing operation is a linear transformation, similar to the graph convolutional layer used in Kipf and Welling .
The GAE framework can easily include the vector-valued variable case. We choose two variable-wise MLPs and where
is a hyperparameter indicating the dimension of latent representation. One may setif is believed to have a lower intrinsic dimension. The final optimization problem is hence to minimize the reconstruction error of the GAE (with penalty):
where is the reconstructed output and and are the MLP weights associated with and , respectively.
It is interesting to compare the proposed generalization Eq. (5) or Eq. (6) with Eq. (2) used by DAG-GNN. The linear SEM can be written as with and being the vectors concatenating all the observational and noise variables, respectively. We consider as additive noises and Eq. (5) or Eq. (6) can be viewed as generalized causal relations taking as inputs for the -th observational variable. In contrast, DAG-GNN further writes linear SEM as and considers it as a generative model which takes random noises as input. The experiments conducted show that our alternative generalization performs better, in both efficiency and efficacy, than DAG-GNN on similar datasets used by Yu et al. .
3.3 Augmented Lagrangian Method
The optimization problem given in Eq. (7) can be solved using the augmented Lagrangian method. The augmented Lagrangian is given by
where , is the Lagrange multiplier, and is the penalty parameter. We then have the following update rules
In this section, we conduct experiments on synthetic datasets to demonstrate the effectiveness of our proposed method. We compare our method against two recent gradient-based methods, NOTEARS Zheng et al.  and DAG-GNN Yu et al. .
We use the same experimental setup as in Yu et al. . In particular, we generate a random DAG using the Erdős–Rényi model with expected node degree , then assign uniformly random edge weights to construct the weighted adjacency matrix . We generate by sampling from ANM with some function elaborated soon. The noise follows standard matrix normal. We report the structural Hamming distance (SHD) and true positive rate (TPR) for each of the method, averaged over four seeds. With sample size , we conduct experiments on four different graph sizes . We consider scalar-valued variables () and vector-valued case () in Sections 4.1 and 4.2, respectively.
For the proposed GAE, we use a 3-layer MLP with 16 ReLU units for both encoder and decoder. For NOTEARSZheng et al.  and DAG-GNN Yu et al. , we use the default hyperparameters found in the authors’ code.
4.1 Scalar-Valued Case
Following the setup in Yu et al. , our first dataset uses the data generating procedure below:
which is a generalized linear model. The results of SHD and TPR are reported in Figure (a)a. One can see that GAE outperforms NOTEARS and DAG-GNN, with SHD close to zero for graphs of 100 nodes. We also observe that NOTEARS has better performance than DAG-GNN when , which indicates that DAG-GNN may not scale well on this dataset.
We next consider a more complicated data generating model, where the nonlinearity occurs after the linear combination of the variables:
As shown in Figure (b)b, our method has better performance than both NOTEARS and DAG-GNN in terms of SHD and TPR across all graph sizes. We conjecture that with higher nonlinearity, the use of proposed GAE framework results in a much better performance than NOTEARS and DAG-GNN. For both nonlinear data generating procedure Eq. (9) and (10), it is surprising that a linear model such as NOTEARS is on par with DAG-GNN or even has better performance in some cases. We hypothesize that the formulation of DAG-GNN results in the lack of causal interpretability on the adjacency matrix learned.
4.2 Vector-Valued Case
To demonstrate the effectiveness of our method on vector-valued variables, we construct a dataset where each dimension of comes from a randomly scaled and perturbed sample. Given a graph adjacency matrix , we first construct from the nonlinear SEM given in Eq. (9). We then generate for the -th dimension , where and
are scalars randomly generated from standard normal distribution andis a standard normal vector. The resulting sample is . Our setup is similar to Yu et al.  but we use a nonlinear SEM in our experiment, which constitutes a more difficult vector-valued dataset.
In particular, we choose and the latent dimension of GAE . The SHD and TPR are reported in Figure 3, which shows that GAE has better SHD and TPR than NOTEARS and DAG-GNN especially when the graph size is large. It is also observed that DAG-GNN slightly outperforms NOTEARS for and .
4.3 Training Time
Scalability is one of the important aspect in causal structure learning Sridhar et al. . To compare the efficiency and scalability of different gradient-based methods, we compute the average training time of GAE and DAG-GNN over all experiments conducted in Section 4.1 and 4.2
. The experiments were computed on NVIDIA V100 Tensor Core GPU with 16GB of memory hosted on AWS GPU cloud instances. We do not include the average training time of NOTEARS as it uses only CPU instances.
As shown in Figure 4, the average training time of GAE is less than 2 minutes even for graphs of 100 nodes. On the other hand, DAG-GNN training can be much more time-consuming: the training procedure takes and minutes for and
, respectively. We also observe that the training time of GAE seems to scale linearly when increasing the graph size to 100. The training time is fast as deep learning is known to be highly parallelizable on GPUBen-Nun and Hoefler , which yields a promising direction in gradient-based methods for causal structure learning.
The formulation of continuous constrained approach by Zheng et al.  enables the application of gradient-based methods on causal structure learning. In this work, we propose a new gradient-based method that generalize the recent gradient-based methods to a GAE framework that allows nonlinear SEM and vector-valued datasets. On synthetic datasets, we demonstrate that our proposed method outperforms other state-of-the-art methods significantly especially on large causal graphs. We also investigate the scalability and efficiency of our method, and observe a near linear training time when scaling the graph size up to 100 nodes. Future works include benchmarking our proposed method on different graph types and real world datasets, as well as testing it on synthetic dataset with larger graph size (up to 500 nodes) to verify if the training time still scales linearly.
- Ben-Nun and Hoefler  Tal Ben-Nun and Torsten Hoefler. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Computing Surveys (CSUR), 52(4):65:1–65:43, August 2019.
- Bertsekas  Dimitri P Bertsekas. Nonlinear Programming. Athena Scientific, 1999.
- Bouckaert  Remco R Bouckaert. Probabilistic network construction using the minimum description length principle. In European Conference on Symbolic and Quantitative Approaches to Reasoning and Uncertainty, 1993.
- Bühlmann et al.  Peter Bühlmann, Jonas Peters, and Jan Ernest. CAM: Causal additive models, high-dimensional order search and penalized regression. The Annals of Statistics, 42(6):2526–2556, 2014.
- Cen et al.  Keting Cen, Huawei Shen, Jinhua Gao, Qi Cao, Bingbing Xu, and Xueqi Cheng. Node classification framework. arXiv preprints arXiv:1906.08745, Jun 2019.
David M Chickering.
Optimal structure identification with greedy search.
Journal of Machine Learning Research, 3:507–554, March 2003.
Chickering and Heckerman 
David M Chickering and David Heckerman.
Efficient approximations for the marginal likelihood of bayesian networks with hidden variables.Machine Learning, 29(2):181–212, Nov 1997.
- Heckerman et al.  David Heckerman, Dan Geiger, and David M Chickering. Learning bayesian networks: The combination of knowledge and statistical data. Machine learning, 20(3):197–243, 1995.
- Hoyer et al.  Patrik O. Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Bernhard Schölkopf. Nonlinear causal discovery with additive noise models. In Advances in Neural Information Processing Systems 21, 2009.
- Janzing and Schölkopf  Dominik Janzing and Bernhard Schölkopf. Causal inference using the algorithmic Markov condition. IEEE Transactions on Information Theory, 56(10):5168–5194, Oct 2010.
- Kingma and Ba  Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR), 2014.
- Kipf and Welling  Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.
- Koller and Friedman  Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning. MIT Press, 2009.
Causal inference and causal explanation with background knowledge.
Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, 1995.
- Pearl  Judea Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2009.
- Pearl  Judea Pearl. The seven tools of causal inference, with reflections on machine learning. Commun. ACM, 62(3):54–60, February 2019.
- Peters et al.  Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of Causal Inference: Foundations and Learning Algorithms. MIT press, 2017.
- Sachs et al.  Karen Sachs, Omar Perez, Dana Pe’er, Douglas A. Lauffenburger, and Garry P. Nolan. Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308(5721):523–529, 2005.
- Shimizu et al.  Shohei Shimizu, Patrik O Hoyer, Aapo Hyvärinen, and Antti Kerminen. A linear non-gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(Oct):2003–2030, 2006.
- Spirtes et al.  Peter Spirtes, Clark N Glymour, Richard Scheines, David Heckerman, Christopher Meek, Gregory Cooper, and Thomas Richardson. Causation, prediction, and search. MIT press, 2000.
- Sridhar et al.  Dhanya Sridhar, Jay Pujara, and Lise Getoor. Scalable probabilistic causal structure discovery. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, 2018.
- Yu et al.  Yue Yu, Jie Chen, Tian Gao, and Mo Yu. DAG-GNN: DAG structure learning with graph neural networks. In International Conference on Machine Learning, 2019.
- Zhang  Jiji Zhang. On the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias. Artificial Intelligence, 172(16-17):1873–1896, 2008.
- Zhang and Hyvärinen  Kun Zhang and Aapo Hyvärinen. On the identifiability of the post-nonlinear causal model. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, 2009.
- Zheng et al.  Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. DAGs with NO TEARS: Continuous optimization for structure learning. In Advances in Neural Information Processing Systems 31, 2018.