1 Introduction
Causal structure learning has important applications in many areas such as genetics Koller and Friedman [2009]; Peters et al. [2017], biology Sachs et al. [2005], and economics Pearl [2019]. An effective way to discover causal relations amongst variables is to conduct controlled experiments, which however is expensive and even ethically prohibited in certain scenarios. Learning causal structure from purely observational data has attracted much research attention in the past decades Pearl [2009]; Spirtes et al. [2000]; Shimizu et al. [2006]; Janzing and Schölkopf [2010].
Two major classes of structure learning methods from observational data are constraint and scorebased. Constraintbased methods, such as PC and FCI Spirtes et al. [2000], first use conditional independence tests to learn the skeleton of the underlying casual graph and then orient the edges based on a series of orientation rules Meek [1995]; Zhang [2008]. Scorebased methods, like GES Chickering [2003], assign a score to each causal graph according to some predefined score function Heckerman et al. [1995]; Chickering and Heckerman [1997]; Bouckaert [1993] and then search over the space of causal directed acyclic graphs (DAGs) to find the one with the optimal score. However, finding the causal graph with optimal score is generally NPhard, largely due to the combinatorial acyclicity constraint on the causal graph. Although the performance of several constraint and scorebased methods is guaranteed with infinite data and suitable assumptions Chickering and Heckerman [1997]; Spirtes et al. [2000], the inferred graphs are usually less satisfactory in practice. More recently, Zheng et al. [2018]
proposed a smooth characterization on the acyclicity constraint, thus transforming the combinatorial optimization problem into a continuous one, provided with a proper loss function. Subsequent work
Yu et al. [2019]has also combined this approach with graph neural networks to model nonlinear causal relationships.
In this work, we propose a new gradientbased structure learning method which generalizes the recent gradientbased methods to a graph autoencoder framework that allows nonlinear structural equation models and vectorvalued variables. We demonstrate that on synthetic datasets, our proposed method outperforms other gradientbased methods significantly, especially on relatively large causal graphs. We analyze the training time of our method and observe a near linear training time when scaling the graph size up to 100 nodes.
2 GradientBased Causal Structure Learning
This section introduces the recently developed gradientbased methods for causal structure learning.
Let be a causal DAG with vertices , where are vectorvalued variables in . In this work, we focus on causal structure learning under additive noise models (ANMs) and adopt the same data generating procedure as in Hoyer et al. [2009], that is,
where denotes the set of variables having a direct edge to in , is a vectorvalued function, and is an additive noise. We assume that ’s are jointly independent, and also write and .
Zheng et al. [2018]
are probably the first to transform the combinatorial optimization problem of scorebased methods into a continuous one for linear structural equation model (SEM), which adopts the following data generation model
assuming both . Here is the coefficient vector and denotes the weighted adjacency matrix of the linear SEM. To ensure the acyclicity of , a smooth constraint on is proposed as
where denotes the matrix exponential of a matrix and denotes the Hadamard product. To learn the underlying causal structure, Zheng et al. [2018] considers the regularized score function consisting of least squares loss and the norm, which, together with the above acyclicity constraint, leads to a continuous optimization problem:
(1)  
subject to 
where is the sample size and denotes the th observed sample of .
The above problem can then be solved by standard numeric optimization methods such as the augmented Lagrangian method Bertsekas [1999]. This approach is then named Noncombinatorial Optimization via Trace Exponential and Augmented lagRangian for Structure learning (NOTEARS) by the authors. Though numeric optimization methods may not produce exact acyclic graphs, i.e., can be very small (e.g., ) but not exactly zero, postprocessing like thresholding can be used to meet the hard acyclicity constraint.
To generalize the above model to nonlinear cases, Yu et al. [2019] proposed DAGGNN, a generative model given below
(2) 
where and
are pointwise (nonlinear) functions. Causal structure learning is then formulated under the framework of variational autoencoders with multilayer perceptrons (MLPs) to model the causal relations, and the objective is to maximize the evidence lower bound under the acyclicity constraint. Notice that here
is regarded as the latent variable and its dimension can be selected to be lower than , the number of variables.3 Graph Autoencoder for Causal Structure Learning
We present an alternative generalization of NOTEARS to handle nonlinear causal relations. We consider the case with each variable being scalarvalued (i.e., ) in Section 3.1 and then the the more general case () in Section 3.2.
3.1 Generalization of NOTEARS to Nonlinear Cases
We can rewrite problem (1) as
(3)  
subject to 
where denotes the data generating model w.r.t. the th observed sample of and denotes the parameters associated with . For NOTEARS, we would have for linear SEM.
One way to extend in Eq. (3) to handle nonlinear relationships is to set
(4) 
where is a nonlinear function and in this work we use an MLP for this function. This generalization is a special case of causal additive model studied in Bühlmann et al. [2014]. While it is possible to learn different MLPs for each variable, we use shared weights across all MLPs, similar to Yu et al. [2019].
3.2 Connection with Graph Autoencoder
We now draw a connection of Eq. (5) with graph autoencoder (GAE), as outlined in Figure 1. We can rewrite Eq. (5) as
(6)  
Eq. (6), together with Eq. (3), form a GAE (similar to Cen et al. [2019]) trained with reconstruction error where and are respectively the variablewise encoder and decoder, and the message passing operation is applied at the latent representation
. In our case, the message passing operation is a linear transformation
, similar to the graph convolutional layer used in Kipf and Welling [2017].The GAE framework can easily include the vectorvalued variable case. We choose two variablewise MLPs and where
is a hyperparameter indicating the dimension of latent representation. One may set
if is believed to have a lower intrinsic dimension. The final optimization problem is hence to minimize the reconstruction error of the GAE (with penalty):(7)  
subject to 
where is the reconstructed output and and are the MLP weights associated with and , respectively.
It is interesting to compare the proposed generalization Eq. (5) or Eq. (6) with Eq. (2) used by DAGGNN. The linear SEM can be written as with and being the vectors concatenating all the observational and noise variables, respectively. We consider as additive noises and Eq. (5) or Eq. (6) can be viewed as generalized causal relations taking as inputs for the th observational variable. In contrast, DAGGNN further writes linear SEM as and considers it as a generative model which takes random noises as input. The experiments conducted show that our alternative generalization performs better, in both efficiency and efficacy, than DAGGNN on similar datasets used by Yu et al. [2019].
3.3 Augmented Lagrangian Method
The optimization problem given in Eq. (7) can be solved using the augmented Lagrangian method. The augmented Lagrangian is given by
where , is the Lagrange multiplier, and is the penalty parameter. We then have the following update rules
(8)  
where and are tuning hyperparameters. Problem (8) is firstorder differentiable and we apply gradient descent method implemented in Tensorflow with Autograd and Adam optimizer Kingma and Ba [2014].
4 Experiments
In this section, we conduct experiments on synthetic datasets to demonstrate the effectiveness of our proposed method. We compare our method against two recent gradientbased methods, NOTEARS Zheng et al. [2018] and DAGGNN Yu et al. [2019].
We use the same experimental setup as in Yu et al. [2019]. In particular, we generate a random DAG using the Erdős–Rényi model with expected node degree , then assign uniformly random edge weights to construct the weighted adjacency matrix . We generate by sampling from ANM with some function elaborated soon. The noise follows standard matrix normal. We report the structural Hamming distance (SHD) and true positive rate (TPR) for each of the method, averaged over four seeds. With sample size , we conduct experiments on four different graph sizes . We consider scalarvalued variables () and vectorvalued case () in Sections 4.1 and 4.2, respectively.
For the proposed GAE, we use a 3layer MLP with 16 ReLU units for both encoder and decoder. For NOTEARS
Zheng et al. [2018] and DAGGNN Yu et al. [2019], we use the default hyperparameters found in the authors’ code.4.1 ScalarValued Case
Following the setup in Yu et al. [2019], our first dataset uses the data generating procedure below:
(9) 
which is a generalized linear model. The results of SHD and TPR are reported in Figure (a)a. One can see that GAE outperforms NOTEARS and DAGGNN, with SHD close to zero for graphs of 100 nodes. We also observe that NOTEARS has better performance than DAGGNN when , which indicates that DAGGNN may not scale well on this dataset.
We next consider a more complicated data generating model, where the nonlinearity occurs after the linear combination of the variables:
(10) 
As shown in Figure (b)b, our method has better performance than both NOTEARS and DAGGNN in terms of SHD and TPR across all graph sizes. We conjecture that with higher nonlinearity, the use of proposed GAE framework results in a much better performance than NOTEARS and DAGGNN. For both nonlinear data generating procedure Eq. (9) and (10), it is surprising that a linear model such as NOTEARS is on par with DAGGNN or even has better performance in some cases. We hypothesize that the formulation of DAGGNN results in the lack of causal interpretability on the adjacency matrix learned.
4.2 VectorValued Case
To demonstrate the effectiveness of our method on vectorvalued variables, we construct a dataset where each dimension of comes from a randomly scaled and perturbed sample. Given a graph adjacency matrix , we first construct from the nonlinear SEM given in Eq. (9). We then generate for the th dimension , where and
are scalars randomly generated from standard normal distribution and
is a standard normal vector. The resulting sample is . Our setup is similar to Yu et al. [2019] but we use a nonlinear SEM in our experiment, which constitutes a more difficult vectorvalued dataset.In particular, we choose and the latent dimension of GAE . The SHD and TPR are reported in Figure 3, which shows that GAE has better SHD and TPR than NOTEARS and DAGGNN especially when the graph size is large. It is also observed that DAGGNN slightly outperforms NOTEARS for and .
4.3 Training Time
Scalability is one of the important aspect in causal structure learning Sridhar et al. [2018]. To compare the efficiency and scalability of different gradientbased methods, we compute the average training time of GAE and DAGGNN over all experiments conducted in Section 4.1 and 4.2
. The experiments were computed on NVIDIA V100 Tensor Core GPU with 16GB of memory hosted on AWS GPU cloud instances. We do not include the average training time of NOTEARS as it uses only CPU instances.
As shown in Figure 4, the average training time of GAE is less than 2 minutes even for graphs of 100 nodes. On the other hand, DAGGNN training can be much more timeconsuming: the training procedure takes and minutes for and
, respectively. We also observe that the training time of GAE seems to scale linearly when increasing the graph size to 100. The training time is fast as deep learning is known to be highly parallelizable on GPU
BenNun and Hoefler [2019], which yields a promising direction in gradientbased methods for causal structure learning.5 Conclusion
The formulation of continuous constrained approach by Zheng et al. [2018] enables the application of gradientbased methods on causal structure learning. In this work, we propose a new gradientbased method that generalize the recent gradientbased methods to a GAE framework that allows nonlinear SEM and vectorvalued datasets. On synthetic datasets, we demonstrate that our proposed method outperforms other stateoftheart methods significantly especially on large causal graphs. We also investigate the scalability and efficiency of our method, and observe a near linear training time when scaling the graph size up to 100 nodes. Future works include benchmarking our proposed method on different graph types and real world datasets, as well as testing it on synthetic dataset with larger graph size (up to 500 nodes) to verify if the training time still scales linearly.
References
 BenNun and Hoefler [2019] Tal BenNun and Torsten Hoefler. Demystifying parallel and distributed deep learning: An indepth concurrency analysis. ACM Computing Surveys (CSUR), 52(4):65:1–65:43, August 2019.
 Bertsekas [1999] Dimitri P Bertsekas. Nonlinear Programming. Athena Scientific, 1999.
 Bouckaert [1993] Remco R Bouckaert. Probabilistic network construction using the minimum description length principle. In European Conference on Symbolic and Quantitative Approaches to Reasoning and Uncertainty, 1993.
 Bühlmann et al. [2014] Peter Bühlmann, Jonas Peters, and Jan Ernest. CAM: Causal additive models, highdimensional order search and penalized regression. The Annals of Statistics, 42(6):2526–2556, 2014.
 Cen et al. [2019] Keting Cen, Huawei Shen, Jinhua Gao, Qi Cao, Bingbing Xu, and Xueqi Cheng. Node classification framework. arXiv preprints arXiv:1906.08745, Jun 2019.

Chickering [2003]
David M Chickering.
Optimal structure identification with greedy search.
Journal of Machine Learning Research
, 3:507–554, March 2003. 
Chickering and Heckerman [1997]
David M Chickering and David Heckerman.
Efficient approximations for the marginal likelihood of bayesian networks with hidden variables.
Machine Learning, 29(2):181–212, Nov 1997.  Heckerman et al. [1995] David Heckerman, Dan Geiger, and David M Chickering. Learning bayesian networks: The combination of knowledge and statistical data. Machine learning, 20(3):197–243, 1995.
 Hoyer et al. [2009] Patrik O. Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Bernhard Schölkopf. Nonlinear causal discovery with additive noise models. In Advances in Neural Information Processing Systems 21, 2009.
 Janzing and Schölkopf [2010] Dominik Janzing and Bernhard Schölkopf. Causal inference using the algorithmic Markov condition. IEEE Transactions on Information Theory, 56(10):5168–5194, Oct 2010.
 Kingma and Ba [2014] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR), 2014.
 Kipf and Welling [2017] Thomas N. Kipf and Max Welling. Semisupervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.
 Koller and Friedman [2009] Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques  Adaptive Computation and Machine Learning. MIT Press, 2009.

Meek [1995]
Christopher Meek.
Causal inference and causal explanation with background knowledge.
In
Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence
, 1995.  Pearl [2009] Judea Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2009.
 Pearl [2019] Judea Pearl. The seven tools of causal inference, with reflections on machine learning. Commun. ACM, 62(3):54–60, February 2019.
 Peters et al. [2017] Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of Causal Inference: Foundations and Learning Algorithms. MIT press, 2017.
 Sachs et al. [2005] Karen Sachs, Omar Perez, Dana Pe’er, Douglas A. Lauffenburger, and Garry P. Nolan. Causal proteinsignaling networks derived from multiparameter singlecell data. Science, 308(5721):523–529, 2005.
 Shimizu et al. [2006] Shohei Shimizu, Patrik O Hoyer, Aapo Hyvärinen, and Antti Kerminen. A linear nongaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(Oct):2003–2030, 2006.
 Spirtes et al. [2000] Peter Spirtes, Clark N Glymour, Richard Scheines, David Heckerman, Christopher Meek, Gregory Cooper, and Thomas Richardson. Causation, prediction, and search. MIT press, 2000.
 Sridhar et al. [2018] Dhanya Sridhar, Jay Pujara, and Lise Getoor. Scalable probabilistic causal structure discovery. In Proceedings of the TwentySeventh International Joint Conference on Artificial Intelligence, 2018.
 Yu et al. [2019] Yue Yu, Jie Chen, Tian Gao, and Mo Yu. DAGGNN: DAG structure learning with graph neural networks. In International Conference on Machine Learning, 2019.
 Zhang [2008] Jiji Zhang. On the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias. Artificial Intelligence, 172(1617):1873–1896, 2008.
 Zhang and Hyvärinen [2009] Kun Zhang and Aapo Hyvärinen. On the identifiability of the postnonlinear causal model. In Proceedings of the TwentyFifth Conference on Uncertainty in Artificial Intelligence, 2009.
 Zheng et al. [2018] Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. DAGs with NO TEARS: Continuous optimization for structure learning. In Advances in Neural Information Processing Systems 31, 2018.