I Introduction
Recently there have been many efforts to imbue deeplearning models with the ability to perform causal inference. This has been motivated primarily by the inability of traditional correlative models to make predictions on interventional and counterfactual questions Spirtes et al. (2000); Pearl (2000), as well as the explainability of causal graphical models. These efforts have largely run in parallel to the developing trend of exploiting the nonlocal properties of graph neural networks Wang et al. (2017)
to generate powerful and efficient representations of highdimensional data.
In this note we dichotomize the task of causal inference as a twostep process, illustrated in Figure 1. The first step involves inferring the graphical structure of a causal model associated with a given observational data set as a directed acyclic graph (DAG). Inferring the structure of causal DAG’s from observational data has a long history and there have been many proposed techniques including constraintbased Spirtes et al. (2000); Pearl (2000); Zhang (2008); Meek (1995) and scorebased methods Bouckaert (1993); Chickering (2002); Chickering and Heckerman (2013); Heckerman et al. (1995), recently developed maskedgradient methods Zheng et al. (2018, 2019); Yu et al. (2019); Ng et al. (2019a, b); Fang et al. (2020); Ng et al. (2020), as well as hybrid methods Lachapelle et al. (2019)
. Notable novel alternatives also include methods based on reinforcementlearning
Zhu and Chen (2019), adversarial networks Kalainathan et al. (2018)and restricted Boltzmann machines
Sokolovska et al. (2020). Since the task of causal structural discovery is merely a means to an end for this work, we (rather arbitrarily) adopt the maskedgradient approach due to its parsimonious integration with the neural network based architectures for SEMlearning that are the subject of this note.^{1}^{1}1codebase: http://github.com/q1park/spacetimeFor the second step of causal inference, we develop a novel autoencoding architecture that applies generative momentmatching neuralnetworks Zhao et al. (2017); Ren et al. (2016) to the edges of the learned causal graph, in order to estimate the functional dependence of the causally related observables as a structural equation model (SEM). Since their inception, generative momentmatching networks have been used for various tasks Bouchacourt (2017); Gao and Huang (2018); Briol et al. (2019); Lotfollahi et al. (2019) related to the estimation of joint and conditional probability distributions, but to our knowledge this is the first use of their applications to an explicit causal graph structure. Our aim is to develop a fully unsupervised formalism that starts from purely observational tabular data, and ends with a robust automated sampling procedure that generates an accurate functional estimate of conditional probability distributions for the associated SEM. Existing techniques for Bayesian sampling on the latent space of generative models are also numerous, including Monte Carlo and gradientoptimization based methods Ahn et al. (2012); Hanson (2001); Park et al. (2018).
Much of this work has been inspired by several recent efforts to develop generative models that encode causal structure. For example, in Kocaoglu et al. (2017)
the authors develop specific conditional adversarial loss functions for learning multistep causal relations. Their goals are similar to those described in this note with a focus on linear relations within highdimensional image vectors. In
Yang et al. (2020)the authors use supervised learning to endow the latent space distributions of a variational autoencoder with a causal graphical structure, with the aim of intervening on this latent space to control specific properties of their feature maps. In this note we perform experiments on simple lowdimensional feature maps, and examine the performance of our autoencoder in generating accurate conditional probability distributions from complex nonlinear multistep causal structures. These causal structures are assumed to exist as relations among dimensions in the latent representation of the data. Thus in principle, the methods described here should also be applicable to more complex feature maps such as those generated by image and language data. However experimentation on these highdimensional data types are beyond the scope of this note.
In Section II we give a brief review of causal graphs and describe a vectorized formulation for structural equation models that is suited for deeplearning applications. In Section III we give the results of our experiments on causal structure learning using existing masked gradient methods. We then describe our algorithm for SEMlearning and provide results on its performance. In Section IV we conclude with a discussion on possible applications and future directions for this work.
Ii Background
ii.1 Causal Graphs
The identification of a causal effect between two variables is equivalent to measuring the response of some endogenous variable with respect to a controlled change in some exogenous variable . If all of the variables are controlled, then the causal effect can be directly inferred via the conditional probability distribution . Inferring causal effects from uncontrolled observational data is challenging due to the existence of confounding variables which generate spurious correlations whose effects on the conditional probability may be statistically indistinguishable from true causal effects. This is illustrated diagramatically in Figure 2. Here we adopt the formalism of Pearl in which the effect of a controlled change in variable is represented on a causal graph by mutilating all of the arrows going into node as shown in Figure 3. The result is referred to as the intervened^{2}^{2}2For notational simplicity we use slashes to indicate graph mutilated variables in conditional probabilities rather than Pearl’s original notation of conditional probability distribution
There exists a rich literature describing the necessary and sufficient conditions for statistical distinguishability between causal and correlative effects, as well as methods for estimating causal responses when these conditions are met Spirtes et al. (2000); Pearl (2000). Although the necessary conditions are beyond the scope of this brief review, the sufficient conditions amount to a requirement that the subset of measured confounding variables must be sufficiently complete so as to provide adequate control over the causal effects. In particular, the requirement of sufficient completeness can be succinctly dichotomized into two cases known as the backdoor and frontdoor criterion. The backdoor criteria can be used to estimate the causal response on a pair of nodes , given an observation of a set of confounding variables as shown in Figure 4. The intervened conditional probability can then be computed via the backdoor adjustment formula given in Equation 1.
(1) 
The frontdoor criteria can be used to estimate the causal response on a pair of nodes in situations where there exists a chain of causal influences as shown in Figure 4. The intervened conditional probability can then be computed via the frontdoor adjustment formula given in Equation 2.
(2) 
ii.2 Structural Equation Models
Structural equation models (SEM’s) are a functional extension of causal graphical models in which the values of each node variable are determined as a function of its parent node variables and noise . Here we adopt a notation where each node in a causal graph with nodes is specified by a spacetime index and Einstein summation is assumed. The set of parent (child) nodes corresponding to is given by () as illustrated in Figure 5. The generic form for an SEM can then be expressed as shown in Equation 3
(3) 
If the contribution from noise is assumed to be additive, then each node variable can be expressed simply as a polynomial (or other) expansion in its parent nodes as shown in Equation 4. The leading order term in this expansion describes a linearized SEM, which is typically expressed in terms of a weighted graph adjacency matrix in the form shown in Equation 5.
(4)  
(5) 
The linear SEM of Equation 5 has the unique property that its exact solution describes a generative model that predicts each variable from pure noise as shown in Equation 6. The inverse operator can be expressed in closedform as a degree polynomial in terms of CayleyHamilton coefficients , which describe the propagation of ancestral noise through the causal graph. Thus each node variable can be expressed as a linear combination of its noise and the noise of its ancestors , as shown in Equation 7.
(6)  
(7) 
The weighted adjacency matrix serves the dual purpose of masking each node variable from its nonparent nodes through its zeroentries, while the nonzero entries define the strength of linear correlations between each pair of nodes in the causal graph. Unfortunately there is no standardized generalization to nonlinear SEM’s. One natural possibility is to define a separate weighted adjacency matrix for each order in a functional expansion like the polynomial example in Equation 4. While this interpretation nicely generalizes the linear approximation, its computational complexity is unbounded, and there have been various other suggested interpretations for the adjacency matrix weights, related to the mutual information between parentchild node variables Fang et al. (2020).
In this note we develop an alternative formalism for describing nonlinear SEM’s that is agnostic to the interpretation of the weights in the adjacency matrix. We thus define a causal mask matrix which is just the unweighted adjacency matrix as shown in Equation 8, where refers to an elementwise multiplication.
(8) 
We then define a procedure for extracting the data for the parents of each node in the following way. We first lift each node variable into an auxiliary dimension . Index contraction of the spacetime index with the mask matrix then produces a vector for each node whose index in the auxiliary dimension contains its parentnode data as shown in Equation 9. This vectorized parental masking procedure is suitable for expressing functions of sets of parentnodes in a generalized SEM as .
(9) 
Iii Experiments
iii.1 Causal Structure Learning
The algorithms for SEMlearning described in this note rely on first inferring the correct causal graph structure for a given data set. Fortunately the last two years have seen exciting progress in applications of neural networks to the problem of causal graph structurelearning, particularly in the area of maskedgradient methods Zheng et al. (2018); Yu et al. (2019); Ng et al. (2019a); Fang et al. (2020); Ng et al. (2020). These methods center around an identity for acyclic weighted adjacency matrices, which was first derived in Zheng et al. (2018) and is shown in Equation 10. This identity enables a reformulation of acyclic graphlearning as a continuous optimization problem. Here again denotes elementwise multiplication.
(10) 
The graphlearning network can then be constructed using an encoder/decoder framework with an objective function that attempts to minimize some reconstruction loss, subject to an acyclicity constraint , where is a function of the weighted adjacency matrix given in Equation 11.
(11) 
The original formulation for this continuous optimization, referred to as NOTEARS Zheng et al. (2018), uses a reconstruction loss inspired directly by the form of the linear SEM in Equation 5. As illustrated in in the first line of Table 1, the encoder is just the identity function while the decoder is an MLP that takes as input a weighted masked latent space vector .
Encoder  Decoder  

NOTEARS:  
GNN:  
GAE: 
In this note we focus our tests on two nonlinear generalizations of the NOTEARS algorithm, referred to as GNN and GAE. The encoder/decoder architectures are given in Table 1, where and refer to generic MLP based functionlearners. Both of the GNN and GAE
frameworks generalize the well known closedform solution for linear SEM’s. However the salient difference between them is the presence of a residual connection in
GNN represented by the identity term in the second line of Table 1. The reconstruction loss function for GNN is given by the usual evidence lowerbound (ELBO) for variational autoencoders while the reconstruction loss for GAE is simply the meansquarederror (MSE). The above optimization can be implemented using the method of Lagrange multipliers with the Lagrangian defined in Equation 12.(12) 
Following the work in Yu et al. (2019); Ng et al. (2019a) we perform tests on four different toy data sets generated by structural equation models of increasing nonlinear complexity, as shown in Equations 1316.
linear:  (13)  
nonlinear 1:  (14)  
nonlinear 2:  (15)  
nonlinear 3:  (16) 
In the original papers, both GNN and GAE were tested using randomly generated ErdősRényi graphs. For graphs with nodes, the authors of GNN reported structural hamming distance (SHD) errors ranging from (for nonlinear 2) and for (nonlinear 1). Impressively, the performance of the GAE algorithm exhibits a scaling that is roughly independent of the number of nodes in the graph for the ErdősRényi case, which we have verified in our own experiments. The primary reason for the difference in performance on large graphs is due to the presence of the residual connection in GNN, which enables an extremely accurate reconstruction of the data despite an incorrect causal graph structure.
In this note we perform tests on the GNN and GAE algorithms using the two graph structures shown in Figure 6, referred to as Graph A and Graph B. These two graph structures form the baseline cases for our structural equation model tests described in the next section, and represent different configurations of confounding variables increasing in number. The results of our structurelearning experiments, shown in Figure 7, indicate that the explicit presence of numerous confounding variables presents a significant obstacle to the recovery of correct causal structures relative to the ErdősRényi case, even for simple graphs with nodes as few as .
iii.2 Structural Equation Modeling
The network architecture for SEMlearning proposed in this note is illustrated in Figure 8, and can be factorized into two components. The first component is just a generic variational autoencoder that encodes each node feature into its latent representation before decoding it back to the target representation . The second component introduces a “causal block” that performs ancestral sampling on the latent representation and produces a latent representation for each childnode that is a function of only its parentnodes .
For SEMlearning on a graph with nodes, the causal block is correspondingly composed of neuralnetworks as illustrated diagramatically in Figure 9. A restriction on the functional dependence of each node to only its parent nodes is crucial for the automated generation of intervened conditional probability distributions. This is achieved simply through the use of the causal mask in the causal block , as well as the absence of any residual connection except for those nodes which have no parents. This includes those nodes which are chosen for intervention, as well as those nodes with no parents since they can be viewed as being intervened on by the environment. Ancestral sampling of an intervened distribution can then be performed simply by generating data for the intervened node
from a randomnormal distribution, and cycling the data through the causal block
times in order to obtain the data for its child node , as illustrated in 8.A functional expression for the causal block can be expressed as a sum of three terms as shown in Equation 17. The first term describes the contribution from noise and is computed via the usual reparameterization trick Kingma and Welling (2013) from neuralnetworkgenerated variances. The second term provides a residual connection only for node variables that have no parents. We thus define a delta function whose argument given a specified node is the number of parents belonging to that node, and normalized as shown in equation 18.
(17)  
(18) 
The third and final term is generated by the set of neural networks whose input is the vector containing the latent representation of ’s parent node data , as constructed according to Equation 9. The loss function used is a combination of the joint Zhao et al. (2017) and conditional Ren et al. (2016) maximummeandiscrepancies (MMD and CMMD) as shown in Equation 19, with . The set of networks thus together form a generative conditional momentmatching graphneuralnetwork.
(19) 
To measure the performance of interventional sampling we perform tests using an MLPbased encoder and decoder /
each consisting of a single hidden layer with 16 neurons. The causal block
is composed of neural networks, each with input dimension and output dimension , and each consisting of a single hiddenlayer containing 64 neurons. For the loss function we choose (rather arbitrarily) and , and each trial is run on 8000 data points. The performance metric used is the relative entropy (KL divergence) between the conditional probability distributions generated by the intervened and unintervened ground truth SEM’s . We then compare it with the relative entropy between the intervened SEM and the one predicted by the causal autoencoderat different standard deviations away from the distribution means, as illustrated in Figure
10. The autoencoder predictions for these results have been smoothened using a kernel density estimator with a normal reference bandwidth.
Iv Discussion
The results of our experiments indicate that the proposed framework for simulating structural equation models is capable of capturing complex nonlinear relationships among variables in way that is amenable to multistep counterfactual interventions. Importantly, the generated probability distributions appear faithful to the ground truth intervened SEM’s, even when the intervened variables are fixed to values that are outside the range of values contained in the training data distributions. This capability implies a predictive ability that is manifestly beyond what is possible through analytical calculations via the backdoor and frontdoor adjustment formulas, which can only be applied to intervened variables that take on values for which observable data exists.
With 8000 data points in each of the training sets, the maximum and minimum values for the node variable typically fall within the range of from the distribution mean, never exceeding . From Figure 11 and 12, we can observe that the linearly correlated data sets are faithful to the ground truth well beyond the mark. On the other hand, those data sets with strong nonlinear components vary in their predictive performance beyond , but are reliably closer to the ground truth relative to the unintervened distributions. This is unsurprising upon closer inspection of the predicted conditional (intervened) probabilities, which demonstrate a clear tendency for our generative model to perform simple linear extrapolations of the distributions in regimes outside those contained in the training data.
Although the experiments performed in this note were restricted to the case of scalarvalued node variables, we expect that a very simple extension of these methods could make them applicable to complex high dimensional image and language data. For example in CausalVAE Yang et al. (2020), the authors use supervised learning to encode specific image labels into a single dimension of the latent space . In one example, they use the CelebA data set of facial images to encode causal relationships between features like , thus allowing them to intervene on the latent space to produce images of unnaturally young bearded faces. Augmenting this procedure with the causal block described in this note would in principle enable synthetic generation of image populations with features that accurately represent conditional probabilities under multiple steps of causal influence. For example, an accurate distribution of hair colors if the graph structure contained . Unfortunately a detailed exploration on these high dimensional data types is beyond the scope of this note.
Another potential application of these methods could be for use with modelbased reinforcement learning. In Dasgupta et al. (2019) the authors performed several experiments in a modelfree RL framework in which they trained agents to make causal predictions in simple onestepquerying scenarios. In these experiments, the agents were directed to sample points from joint and conditional probability distributions of SEMgenerated data, as well as the corresponding distributions from arbitrarily mutilated SEM graphs. These experiments showed evidence that their agents learned to exploit interventional and counterfactual reasoning to accumulate significantly higher rewards compared to the relevant baselines.
In Nair et al. (2019) the authors expand on the previous work by successfully training RL agents to perform causal reasoning in a more complex multistep relational scenario with the ability to generalize to unseen causal structures that were heldout during training. Their experiments involved two separate RL agents. One which used supervised learning to generate a causal graph model off ground truth graphs, and another which was directed to take “goaloriented” actions based on models learned by the first agent. The authors strongly hypothesized that the impressive level of generalizability displayed by their algorithm was a direct result of the explicit modelbased approach. We find the possibility of performing such experiments using graphical models learned via the fully unsupervised approach described in this note to be both very intriguing and plausibly practical as a future area of exploration.
V Acknowledgements
We thank Vincent Tang, Jiheum Park, Ignavier Ng, Jungwoo Lee, and Tim Lou for useful discussions.
References
 Spirtes et al. [2000] Peter Spirtes, Clark Glymour, and Richard Scheines. Causation, Prediction, and Search. The MIT Press, 3 edition, 2000. ISBN 9780387979793.
 Pearl [2000] Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 1 edition, 2000. ISBN 0521773628.
 Wang et al. [2017] Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, and Kaiming He. Nonlocal neural networks. CoRR, abs/1711.07971, 2017. URL http://arxiv.org/abs/1711.07971.
 Zhang [2008] Jiji Zhang. On the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias. 2008.

Meek [1995]
Christopher Meek.
Causal inference and causal explanation with background knowledge.
In
Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence
, UAI’95, page 403410, San Francisco, CA, USA, 1995. Morgan Kaufmann Publishers Inc. ISBN 1558603859.  Bouckaert [1993] Remco R. Bouckaert. Probabilistic network construction using the minimum description length principle. In Michael Clarke, Rudolf Kruse, and Serafín Moral, editors, Symbolic and Quantitative Approaches to Reasoning and Uncertainty, pages 41–48, Berlin, Heidelberg, 1993. Springer Berlin Heidelberg. ISBN 9783540481300.
 Chickering [2002] David Maxwell Chickering. Optimal structure identification with greedy search. J. Mach. Learn. Res., 3:507–554, 2002.
 Chickering and Heckerman [2013] David Maxwell Chickering and David Heckerman. Efficient approximations for the marginal likelihood of incomplete data given a bayesian network. CoRR, abs/1302.3567, 2013. URL http://arxiv.org/abs/1302.3567.

Heckerman et al. [1995]
David Heckerman, Dan Geiger, and David Chickering.
Learning bayesian networks: The combination of knowledge and statistical data.
Machine Learning, 20:197–243, 09 1995. doi: 10.1007/BF00994016.  Zheng et al. [2018] Xun Zheng, Bryon Aragam, Pradeep Ravikumar, and Eric P. Xing. Dags with no tears: Continuous optimization for structure learning, 2018.
 Zheng et al. [2019] Xun Zheng, Chen Dan, Bryon Aragam, Pradeep Ravikumar, and Eric P. Xing. Learning sparse nonparametric dags, 2019.
 Yu et al. [2019] Yue Yu, Jie Chen, Tian Gao, and Mo Yu. DAGGNN: DAG structure learning with graph neural networks. CoRR, abs/1904.10098, 2019. URL http://arxiv.org/abs/1904.10098.
 Ng et al. [2019a] Ignavier Ng, Shengyu Zhu, Zhitang Chen, and Zhuangyan Fang. A graph autoencoder approach to causal structure learning, 2019a.
 Ng et al. [2019b] Ignavier Ng, Zhuangyan Fang, Shengyu Zhu, Zhitang Chen, and Jun Wang. Masked gradientbased causal structure learning, 2019b.
 Fang et al. [2020] Zhuangyan Fang, Shengyu Zhu, Jiji Zhang, Yue Liu, Zhitang Chen, and Yangbo He. Low rank directed acyclic graphs and causal structure learning, 2020.
 Ng et al. [2020] Ignavier Ng, AmirEmad Ghassami, and Kun Zhang. On the role of sparsity and dag constraints for learning linear dags, 2020.
 Lachapelle et al. [2019] Sébastien Lachapelle, Philippe Brouillard, Tristan Deleu, and Simon LacosteJulien. Gradientbased neural DAG learning. CoRR, abs/1906.02226, 2019. URL http://arxiv.org/abs/1906.02226.
 Zhu and Chen [2019] Shengyu Zhu and Zhitang Chen. Causal discovery with reinforcement learning. CoRR, abs/1906.04477, 2019. URL http://arxiv.org/abs/1906.04477.
 Kalainathan et al. [2018] Diviyan Kalainathan, Olivier Goudet, Isabelle Guyon, David LopezPaz, and Michèle Sebag. Structural agnostic modeling: Adversarial learning of causal graphs, 2018.
 Sokolovska et al. [2020] Nataliya Sokolovska, Olga Permiakova, Sofia K. Forslund, and JeanDaniel Zucker. Using unlabeled data to discover bivariate causality with deep restricted boltzmann machines. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 17:358–364, 2020.
 Zhao et al. [2017] Shengjia Zhao, Jiaming Song, and Stefano Ermon. Infovae: Information maximizing variational autoencoders. CoRR, abs/1706.02262, 2017. URL http://arxiv.org/abs/1706.02262.
 Ren et al. [2016] Yong Ren, Jialian Li, Yucen Luo, and Jun Zhu. Conditional generative momentmatching networks. CoRR, abs/1606.04218, 2016. URL http://arxiv.org/abs/1606.04218.
 Bouchacourt [2017] Diane Bouchacourt. Taskoriented learning of structured probability distributions. PhD thesis, University of Oxford, 2017.
 Gao and Huang [2018] Hongchang Gao and Heng Huang. Joint generative momentmatching network for learning structural latent code. pages 2121–2127, 07 2018. doi: 10.24963/ijcai.2018/293.
 Briol et al. [2019] FrancoisXavier Briol, Alessandro Barp, Andrew B. Duncan, and Mark Girolami. Statistical inference for generative models with maximum mean discrepancy, 2019.
 Lotfollahi et al. [2019] Mohammad Lotfollahi, Mohsen Naghipourfar, Fabian J. Theis, and F. Alexander Wolf. Conditional outofsample generation for unpaired data using trvae, 2019.
 Ahn et al. [2012] Sungjin Ahn, Anoop Korattikara, and Max Welling. Bayesian posterior sampling via stochastic gradient fisher scoring, 2012.
 Hanson [2001] Kenneth M. Hanson. Markov chain Monte Carlo posterior sampling with the Hamiltonian method. volume 4322 of Society of PhotoOptical Instrumentation Engineers (SPIE) Conference Series, pages 456–467, July 2001. doi: 10.1117/12.431119.
 Park et al. [2018] Chanwoo Park, JaeMyung Kim, Seok Hyeon Ha, and Jungwoo Lee. Samplingbased bayesian inference with gradient uncertainty. CoRR, abs/1812.03285, 2018. URL http://arxiv.org/abs/1812.03285.
 Kocaoglu et al. [2017] Murat Kocaoglu, Christopher Snyder, Alexandros G. Dimakis, and Sriram Vishwanath. Causalgan: Learning causal implicit generative models with adversarial training. CoRR, abs/1709.02023, 2017. URL http://arxiv.org/abs/1709.02023.

Yang et al. [2020]
Mengyue Yang, Furui Liu, Zhitang Chen, Xinwei Shen, Jianye Hao, and Jun Wang.
Causalvae: Disentangled representation learning via neural structural causal models, 2020.
 Kingma and Welling [2013] Diederik P Kingma and Max Welling. Autoencoding variational bayes, 2013.
 Dasgupta et al. [2019] Ishita Dasgupta, Jane X. Wang, Silvia Chiappa, Jovana Mitrovic, Pedro A. Ortega, David Raposo, Edward Hughes, Peter W. Battaglia, Matthew Botvinick, and Zeb KurthNelson. Causal reasoning from metareinforcement learning. CoRR, abs/1901.08162, 2019. URL http://arxiv.org/abs/1901.08162.
 Nair et al. [2019] Suraj Nair, Yuke Zhu, Silvio Savarese, and Li FeiFei. Causal induction from visual observations for goal directed tasks, 2019. URL http://arxiv.org/abs/1910.01751.
References
 Spirtes et al. [2000] Peter Spirtes, Clark Glymour, and Richard Scheines. Causation, Prediction, and Search. The MIT Press, 3 edition, 2000. ISBN 9780387979793.
 Pearl [2000] Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 1 edition, 2000. ISBN 0521773628.
 Wang et al. [2017] Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, and Kaiming He. Nonlocal neural networks. CoRR, abs/1711.07971, 2017. URL http://arxiv.org/abs/1711.07971.
 Zhang [2008] Jiji Zhang. On the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias. 2008.
 Meek [1995] Christopher Meek. Causal inference and causal explanation with background knowledge. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, UAI’95, page 403410, San Francisco, CA, USA, 1995. Morgan Kaufmann Publishers Inc. ISBN 1558603859.
 Bouckaert [1993] Remco R. Bouckaert. Probabilistic network construction using the minimum description length principle. In Michael Clarke, Rudolf Kruse, and Serafín Moral, editors, Symbolic and Quantitative Approaches to Reasoning and Uncertainty, pages 41–48, Berlin, Heidelberg, 1993. Springer Berlin Heidelberg. ISBN 9783540481300.
 Chickering [2002] David Maxwell Chickering. Optimal structure identification with greedy search. J. Mach. Learn. Res., 3:507–554, 2002.
 Chickering and Heckerman [2013] David Maxwell Chickering and David Heckerman. Efficient approximations for the marginal likelihood of incomplete data given a bayesian network. CoRR, abs/1302.3567, 2013. URL http://arxiv.org/abs/1302.3567.
 Heckerman et al. [1995] David Heckerman, Dan Geiger, and David Chickering. Learning bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20:197–243, 09 1995. doi: 10.1007/BF00994016.
 Zheng et al. [2018] Xun Zheng, Bryon Aragam, Pradeep Ravikumar, and Eric P. Xing. Dags with no tears: Continuous optimization for structure learning, 2018.
 Zheng et al. [2019] Xun Zheng, Chen Dan, Bryon Aragam, Pradeep Ravikumar, and Eric P. Xing. Learning sparse nonparametric dags, 2019.
 Yu et al. [2019] Yue Yu, Jie Chen, Tian Gao, and Mo Yu. DAGGNN: DAG structure learning with graph neural networks. CoRR, abs/1904.10098, 2019. URL http://arxiv.org/abs/1904.10098.
 Ng et al. [2019a] Ignavier Ng, Shengyu Zhu, Zhitang Chen, and Zhuangyan Fang. A graph autoencoder approach to causal structure learning, 2019a.
 Ng et al. [2019b] Ignavier Ng, Zhuangyan Fang, Shengyu Zhu, Zhitang Chen, and Jun Wang. Masked gradientbased causal structure learning, 2019b.
 Fang et al. [2020] Zhuangyan Fang, Shengyu Zhu, Jiji Zhang, Yue Liu, Zhitang Chen, and Yangbo He. Low rank directed acyclic graphs and causal structure learning, 2020.
 Ng et al. [2020] Ignavier Ng, AmirEmad Ghassami, and Kun Zhang. On the role of sparsity and dag constraints for learning linear dags, 2020.
 Lachapelle et al. [2019] Sébastien Lachapelle, Philippe Brouillard, Tristan Deleu, and Simon LacosteJulien. Gradientbased neural DAG learning. CoRR, abs/1906.02226, 2019. URL http://arxiv.org/abs/1906.02226.
 Zhu and Chen [2019] Shengyu Zhu and Zhitang Chen. Causal discovery with reinforcement learning. CoRR, abs/1906.04477, 2019. URL http://arxiv.org/abs/1906.04477.
 Kalainathan et al. [2018] Diviyan Kalainathan, Olivier Goudet, Isabelle Guyon, David LopezPaz, and Michèle Sebag. Structural agnostic modeling: Adversarial learning of causal graphs, 2018.
 Sokolovska et al. [2020] Nataliya Sokolovska, Olga Permiakova, Sofia K. Forslund, and JeanDaniel Zucker. Using unlabeled data to discover bivariate causality with deep restricted boltzmann machines. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 17:358–364, 2020.
 Zhao et al. [2017] Shengjia Zhao, Jiaming Song, and Stefano Ermon. Infovae: Information maximizing variational autoencoders. CoRR, abs/1706.02262, 2017. URL http://arxiv.org/abs/1706.02262.
 Ren et al. [2016] Yong Ren, Jialian Li, Yucen Luo, and Jun Zhu. Conditional generative momentmatching networks. CoRR, abs/1606.04218, 2016. URL http://arxiv.org/abs/1606.04218.
 Bouchacourt [2017] Diane Bouchacourt. Taskoriented learning of structured probability distributions. PhD thesis, University of Oxford, 2017.
 Gao and Huang [2018] Hongchang Gao and Heng Huang. Joint generative momentmatching network for learning structural latent code. pages 2121–2127, 07 2018. doi: 10.24963/ijcai.2018/293.
 Briol et al. [2019] FrancoisXavier Briol, Alessandro Barp, Andrew B. Duncan, and Mark Girolami. Statistical inference for generative models with maximum mean discrepancy, 2019.
 Lotfollahi et al. [2019] Mohammad Lotfollahi, Mohsen Naghipourfar, Fabian J. Theis, and F. Alexander Wolf. Conditional outofsample generation for unpaired data using trvae, 2019.
 Ahn et al. [2012] Sungjin Ahn, Anoop Korattikara, and Max Welling. Bayesian posterior sampling via stochastic gradient fisher scoring, 2012.
 Hanson [2001] Kenneth M. Hanson. Markov chain Monte Carlo posterior sampling with the Hamiltonian method. volume 4322 of Society of PhotoOptical Instrumentation Engineers (SPIE) Conference Series, pages 456–467, July 2001. doi: 10.1117/12.431119.
 Park et al. [2018] Chanwoo Park, JaeMyung Kim, Seok Hyeon Ha, and Jungwoo Lee. Samplingbased bayesian inference with gradient uncertainty. CoRR, abs/1812.03285, 2018. URL http://arxiv.org/abs/1812.03285.
 Kocaoglu et al. [2017] Murat Kocaoglu, Christopher Snyder, Alexandros G. Dimakis, and Sriram Vishwanath. Causalgan: Learning causal implicit generative models with adversarial training. CoRR, abs/1709.02023, 2017. URL http://arxiv.org/abs/1709.02023.
 Yang et al. [2020] Mengyue Yang, Furui Liu, Zhitang Chen, Xinwei Shen, Jianye Hao, and Jun Wang. Causalvae: Disentangled representation learning via neural structural causal models, 2020.
 Kingma and Welling [2013] Diederik P Kingma and Max Welling. Autoencoding variational bayes, 2013.
 Dasgupta et al. [2019] Ishita Dasgupta, Jane X. Wang, Silvia Chiappa, Jovana Mitrovic, Pedro A. Ortega, David Raposo, Edward Hughes, Peter W. Battaglia, Matthew Botvinick, and Zeb KurthNelson. Causal reasoning from metareinforcement learning. CoRR, abs/1901.08162, 2019. URL http://arxiv.org/abs/1901.08162.
 Nair et al. [2019] Suraj Nair, Yuke Zhu, Silvio Savarese, and Li FeiFei. Causal induction from visual observations for goal directed tasks, 2019. URL http://arxiv.org/abs/1910.01751.
Comments
There are no comments yet.