1 Introduction
Discovering and understanding causal mechanisms underlying natural phenomena are important to many disciplines of sciences. An effective approach to studying causal relationships is to conduct controlled randomized experiments, which however is too expensive or even impossible in certain fields such as social sciences Bollen (1989) and bioinformatics OpgenRhein and Strimmer (2007). Causal discovery methods that infer causal relationships from passively observable data hence are attractive in these scenarios and have been an important research topic in the past decades Pearl (2009); Spirtes et al. (2000); Peters et al. (2017).
A major class of such causal discovery methods are scorebased, which assign a score , typically computed with the observed data, to each graph and then search over the space of all directed acyclic graphs (DAGs) for the best scoring:
(1) 
While there have been well defined score functions such as the BIC/MDL score Schwarz et al. (1978); Chickering (2002) and the BGe score Geiger and Heckerman (1994) for linearGaussian models, problem (1) is generally NPhard to solve Chickering (1996); Chickering et al. (2004), largely due to its nonconvex and combinatorial nature; the acyclicity constraint is a combinatorial constraint with the number of DAGs increasing superexponentially in the number of graph nodes. To tackle this problem, most existing approaches rely on local heuristics to enforce the acyclicity. For example, the greedy equivalence search (GES) Chickering (2002) enforces acyclicity one edge at a time, explicitly checking for the acyclicity constraint each time an edge is added. GES is known to find the global minimizer with infinite samples under suitable assumptions Chickering (2002), but this is not guaranteed for finite data. There are hybrid methods that use constraintbased approaches to reduce the search space before applying scorebased methods, e.g., Tsamardinos et al. (2006). However, this methodology lacks a principled way of choosing a problemspecific combination of score functions and search strategies. Alternatively, Zheng et al. (2018) introduced a smooth characterization for the acyclicity constraint, and problem (1
) can be formulated as a continuous optimization problem w.r.t. the weighted adjacency matrix by picking a proper loss function, e.g., the least squares loss. It is import to note that the loss function must be carefully chosen in order to apply numeric optimization methods. Unfortunately, several effective score functions, e.g., the generalized score function proposed in
Huang et al. (2018) for possibly nonlinear causal relationships, cannot be easily represented in closedform loss functions and thus can not be used with this approach.We propose to use reinforcement learning (RL) to search for the DAG with the best scoring. The key insight is that an RL agent with stochastic policy can determine automatically where to search given the uncertainty information of the learned policy, which gets updated promptly by the stream of reward signals. To apply RL to causal discovery, we use an encoderdecoder model to generate directed causal graphs from the observed data. The graphs are then used to compute rewards, which include predefined score functions as well as penalty terms to enforce acyclicity. We resort to the actorcritic algorithm and stochastic methods to optimize the parameters of the neural networks, and our output is the graph that achieves the best reward, among all graphs generated in the training process. Experiments on both synthetic and real data show that our approach has a much improved search ability without sacrificing any flexibility in choosing score functions. In particular, the proposed approach using BIC as score function outperforms GES with the same score function on linear nonGaussian acyclic model (LiNGAM) dataset, and also outperforms NOTEARS proposed in
Zheng et al. (2018) when the causal relationships are nonlinear.2 Related Works
Other causal discovery methods
Constraintbased methods use conditional independence tests to find the causal skeleton and then determine the orientations of the edges up to the Markov equivalence class, which usually contains DAGs that can be structurally diverse and may still have many unoriented edges. Examples include Sun et al. (2007); Zhang et al. (2012) using kernelbased conditional independence criteria and the wellknown PC algorithm Spirtes et al. (2000). This class of methods involve a multiple testing problem where the tests are usually conducted independently. The testing results may have conflicts and handling them is not an easy task, though there do exist certain works, e.g., Hyttinen et al. (2014). These methods are not robust as small errors in building the graph skeleton can result in large errors in the inferred Markov equivalence class. Another class of causal discovery methods are based on functional models, with assumptions on data distribution and/or functional classes. Examples include LiNGAM Shimizu et al. (2006, 2011), the postnonlinear causal model Zhang and Hyvärinen (2009), and the nonlinear additive noise model Peters et al. (2017).
Neural combinatorial optimizer and reinforcement learning
Recent advances in sequencetosequence learning Sutskever et al. (2014) have motivated the use of neural networks for optimization in various domains Vinyals et al. (2015); Zoph and Le (2016); Chen et al. (2017). A particular example is the traveling salesman problem (TSP) which was revisited in the work of pointer networks Vinyals et al. (2015)
. The authors proposed a recurrent neural network (RNN) with nonparametric softmaxes trained in a supervised manner to predict the sequence of visited cities. Realizing that getting highquality labeled data is expensive and may be infeasible for new problems,
Bello et al. (2016) use the RL paradigm to tackle the combinatorial problems due to their relatively simple reward mechanisms. It was shown that an RL agent can have a better generalization even when the optimal solutions are used as labeled data in the previous supervised approach. Besides Bello et al. (2016), there are many other successful RL applications in recent years, e.g., AlphaGo Silver et al. (2017), where the typical goal is to learn a good policy. An exception is the work Zoph and Le (2016)which used RL to search for neural architectures. While we use a similar idea as the RL paradigm can naturally include the search task, our work is different in the actor and reward designs: our actor is an encoderdecoder scheme based on the factorization of probability of graph adjacency matrix where each decomposed term is conditioned only on the input sample, while the reward is specific to causal discovery by incorporating both score function and the acyclicity constraint.
3 Model Definition
We assume the following model for data generating procedure, as in Hoyer et al. (2009). Each variable is associated with a node in a node DAG , and the observed value of is obtained as a function of its parents in the graph plus an independent additive noise , i.e.,
where denotes the set of variables so that there is an edge from to in the graph, and the noises are assumed to be jointly independent. We do not restrict specific forms for the functions and noises , but knowing such information can be beneficial to infer the underlying causal relationships. If all the functions are linear and all the noises
are Gaussian distributed, this model yields the standard linearGaussian model family that has been studied in
Spirtes et al. (2000); Geiger and Heckerman (1994); Bollen (1989); Peters et al. (2017). When the functions are linear but the noises are nonGaussian, one can obtain the LiNGAM described in Shimizu et al. (2006, 2011) and the underlying DAG can be uniquely identified under favorable conditions.In this paper, we consider that all the variables are scalars; extending to more complex cases is straightforward with a properly defined score function. The observed data
, consisting of a number of vectors
, are then sampled independently according to the above model on an unknown DAG, with fixed functions and fixed distributions for . The objective of causal discovery is to use the observed data, which give the empirical version of the joint distribution of
, to infer the causal relationships.4 Neural Network Architecture for Graph Generation
Given a dataset where each , we want to infer the causal graph that best describes the data generating procedure. Motivated by recent advances in neural combinatorial optimization, particularly the pointer networks Bello et al. (2016); Vinyals et al. (2015), we use neural networks to infer causal graphs from the observed data. Since using a single observation may incur much variation, we draw random samples (with replacement) from and reshape them as where is the vector concatenating all the th entries of the vectors in . In an analogy to the TSP, this represents a sequence of cities lying in an dim space. We are concerned with generating a graph adjacency matrix where denotes the th row of the adjacency matrix so that the corresponding graph is a DAG and achieves the best score.
We aim to learn a stochastic policy that, given the input , assigns higher probabilities to DAGs with better scores, so as to guide the search for the optimal score. In a similar fashion to previous works Sutskever et al. (2014); Vinyals et al. (2015)
that make use of the factorization based on the chain rule, our neural network architecture factorizes the probability as
where denotes the set of vectors with and is the empty set. Different from previous tasks such as machine translation and the TSP, the th row vector can be generated independent of previous outputs and we hence remove the conditionals in the factorization. In this paper, we use this decomposed factorization and existing encoderdecoder models to generate graphs:
Encoder
We use the same attention based encoder in the Transformer structure proposed by Vaswani et al. (2017). Denote the output of the encoder by , corresponding to each input .
Decoder
The encoder outputs are passed into the RNN based decoder in the pointer networks, which sequentially outputs a vector that will be used to generate for
. Note that we do not need the followed pointing scheme in the pointer networks. Alternatively, we can use the decoder structure from Transformer which is the multihead attention module followed by a feedforward neural network (FNN) applied to
and the weights of the FNN are shared across all . The output consists of vectors in , which we also write as . To generate a binary adjacency matrix , we pass each entry in the vectors , denoted as, into a logistic sigmoid function
and then sample according to Bernoulli distribution with probability
. To avoid selfloop, we simply mask the th entry in the adjacency matrix.Empirically, we find that the proposed approach with attention based decoder tends to explore more different graphs. As such, the attention based decoder will be combined with some fastcomputing score functions, while the RNN based decoder is used if the computation takes a longer time.
5 Reinforcement Learning for Search
In this section, we propose to use RL as the search strategy for finding the DAG with the best score, outlined in Figure 1. As one will see, the proposed method improves the search ability over traditional scorebased methods and also allows for flexible score functions under the acyclicity constraint.
5.1 Score Function, Acyclicity, and Reward
We may use existing score functions to construct the reward that will be maximized by an RL algorithm, e.g., the BIC score. Though the generalized score function proposed in Huang et al. (2018) is applicable, it needs too much computation for an RL agent. For a nonlinear causal relationship , Hoyer et al. (2009)
proposed to do a nonlinear regression of
on , calculate the corresponding residual , and then test whether is independent of to decide whether to accept the causal relationship. We can follow the same procedure, use some independence statistics, e.g., Hilbert Schmidt Independence Criterion (HSIC) Gretton et al. (2005), to quantify the independence for each causal relationship of a given graph, and sum all the HSICs as our score. However, we find it not easy to handle isolate variables, i.e., nodes that have no parents.Noticing that the noises are jointly independent, we further construct a new score function by summing all the HSICs between pairs of residuals. To make it robust to data scales, we use the normalized version of HSIC De Lozzo and Marrel (2016). The score function is given by
Assuming linear relationship and zeromean
, a linear regression results in an estimate
, which would be given infinite data. Since is independent of , we have and so that is exactly the same as . Thus, the proposed score function becomes zero with infinite data and is small with finite data for the true causal graph. We conjecture that this result can be extended to nonlinear relationships under certain conditions, but leave a careful investigation as future work. Also, the proposed score function being zero is only a necessary condition for finding the true graph. The inferred graph with this score function may contain spurious edges and pruning can be done either in a greedy way or by majority vote, which will be discussed later. In Section 6, the proposed score function is shown to achieve a better result than other causal discovery methods on a real dataset.A remaining issue is the acyclicity constraint. Other than GES which explicitly checks for the acyclic constraint each time an edge is added, we can directly add penalty terms w.r.t. acyclicity to the score function. Here we use the result from Zheng et al. (2018) which introduced a smooth characterization for the acyclicity constraint. In particular, a directed graph with binary adjacency matrix is a DAG if and only if
where is the number of graph nodes and is the matrix exponential of . It was argued that , which is nonnegative, also quantifies some ‘DAGness’ of a directed graph. Here DAGness means how far the graph is from being a DAG. Since can be small for certain cyclic graphs, we add another penalty term, the indicator function w.r.t. acyclicity, to induce exact DAGs.
Our final reward incorporates both the score function and the acyclicity constraint, which is
(2) 
where denotes the indicator function and are two penalty parameters. Clearly, the larger and are, the more likely a generated graph with a high reward is acyclic. Empirically, we find that both the score function and the penalty terms help search for the optimal score under the acyclicity constraint. Hence we need to choose proper and so that the scores and the penalty values have similar orders. We also find that the parameter choices are quite robust, and one way to pick these values is through randomly sampled cyclic and acyclic graphs. For small graphs, only using suffices; for large graphs, the DAGness term can be much helpful to generate DAGs.
5.2 ActorCritic Algorithm
We believe that exploitation and exploration in the RL paradigm provide an appropriate way of guiding the search. Let denote the parameters of the neural networks for graph generation. Our training objective is the expected reward defined as
(3) 
During training, the input is constructed by randomly drawing samples from the observed dataset , as described in Section 4.
We resort to policy gradient methods and stochastic methods to optimize the parameters. The gradient of (3) can be obtained by the wellknown REINFORCE algorithm Williams (1992); Sutton et al. (2000). We draw samples as a batch to estimate the gradient which is then used to train the neural networks through stochastic optimization methods like Adam Kingma and Ba (2014)
. Using a parametric baseline to estimate the reward can also help training. For the present work, our critic is a simple 2layer FNN with ReLU units, with input being the encoder outputs
. The critic is trained with Adam on a mean squared error between its predictions and the true rewards.Training Acceleration
Training an RL agent typically requires many iterations. In the present work, computing the rewards for given graphs are much more timeconsuming than training neural networks. Therefore, we record the computed rewards according to different graph structures. Moreover, BIC and the HSIC based score function can be decomposed according to single and pairwise causal relationships, respectively. We also record these results to avoid repeated computations.
5.3 Final Output
Since we are concerned with finding a DAG with the best score rather than a policy, we record all the graphs generated during the training process and output the one with the best reward. In practice, the graph may contain spurious edges and further processing is needed.
To this end, we can prune the estimated edges in a greedy way, according to either the regression performance or the score function. For an inferred causal relationship, we remove a parent variable and calculate the performance of the resulting graph, with all other causal relationships unchanged. If the performance does not degrade or degrade within a predefined tolerance, we accept pruning and continue this process with the pruned causal relationship. For linear models, pruning can be simply done by thresholding the estimated coefficients. Note that pruning a DAG cannot violate the acyclicity constraint. Besides, we observe that the top few graphs, ranked by their rewards, are usually structurally similar. Thus, we can conduct a majority vote to select common edges that are potentially more important to achieving high rewards, possibly followed by an acyclicity check.
Related to the above pruning process is to add to the reward an penalty w.r.t. the adjacency matrix. While the RL paradigm can easily incorporate such a penalty term, its weight becomes critical to the final result and is more sensitive than the acyclicity penalty parameters in Eq. (2). We will consider the penalty approach in a future work and only use pruning or majority vote in this work.
6 Experimental Results
We report empirical results on both synthetic and real data to compare our approach against GES, the PC algorithm, ICALiNGAM^{1}^{1}1 https://sites.google.com/site/sshimizu06/lingam Shimizu et al. (2006), and NOTEARS^{2}^{2}2 https://github.com/xunzheng/notears Zheng et al. (2018). NOTEARS recovers the causal graph by estimating the weighted adjacency matrix with the least squares loss and the smooth characterization for acyclicity constraint, followed by thresholding on the estimated weights. Since the fast greedy search (FGS) implementation^{3}^{3}3Available through the pycausal package at https://github.com/bd2kccd/pycausal Ramsey et al. (2017) for GES has been reported to outperform the PC algorithm and other techniques such as hillclimbing and MMHC Han et al. (2016); Zheng et al. (2018), we will only report the results of FGS, NOTEARS, and ICALiNGAM.
Our approach is implemented based on an existing implementation of neural combinatorial optimizer^{4}^{4}4 https://github.com/MichelDeudon/neuralcombinatorialoptimizationrltensorflow. We modify the decoder as described in Section 4 and leave other hyperparameters unchanged. We pick for constructing the input sample and as batch size at each training iteration. We set the maximum number of iterations to be . The RNN based decoder is combined with the proposed HSIC based score function while the attention based decoder uses the average BIC as score function (under Gaussianity assumption), denoted as RLHSIC and RLBIC, respectively. Here average BIC refers to the original BIC divided by the number of samples used for computing. Experiments are run with an Intel Xeon 2.30GHz CPU and an NVIDIA Tesla K80 GPU.
We evaluate the learned graphs using three metrics: false discovery rate (FDR), true positive rate (TPR), and structural hamming distance (SHD)^{5}^{5}5 Total number of edge additions, deletions, and reversals to convert the estimated graph into the true DAG.. Since FGS may output unoriented edges, we follow Zheng et al. (2018) to treat FGS favorably by treating undirected edges as true positives as long as the true graph has a directed edge in place of the undirected edge.
6.1 Linear Model with Gaussian and NonGaussian Noise
Given number of variables , we generate a lower triangular matrix as the graph adjacency matrix, in which the lower entries are sampled independently from . We assign edge weights independently from to obtain a weight matrix , and then sample from both Gaussian and nonGaussian noise models. The nonGaussian model is the same as the one used for ICALiNGAM Shimizu et al. (2006)
, which generate samples from Gaussian distributions and subsequently pass them through a power nonlinearity to make them nonGaussian. We pick unit variances for all noises in both models and generate
samples as our dataset. A permutation of variables is then performed.We first consider a small graph with nodes. We pick and for the penalty parameters in RLHSIC and RLBIC. Figure 2 shows the learning process of the proposed methods on an LiNGAM dataset. We use a threshold to prune the estimated edges so that an edge is deleted if the absolute value of its estimated weight is below the threshold. In this example, the pruned DAGs turn out to be exactly the same as the underlying causal graph. RLHSIC and RLBIC respectively generate and different graphs when training ends, which are much smaller than the total number (more than 3 million) of DAGs.
Data model  Criterion  RLHSIC  RLBIC  NOTEARS  FGS  ICALiNGAM 

LiNGAM  FDR  
TPR  
SHD  
LinearGaussian  FDR  
TPR  
SHD 
We next report the results of other causal discovery methods on LiNGAM and linearGaussian data models in Table 1. ICALiNGAM recovers all the true causal graphs for LiNGAM data but performs poorly on linearGaussian data. On the contrary, FGS performs well on the linearGaussian data but poorly on the LiNGAM data. This is not surprising: ICALiNGAM assumes nonGaussian noise, whereas FGS assumes linearGaussian models and uses BIC as score function. However, RLBIC recovers all the true causal graphs on both data models in this experiment, implying that BIC also works well for LiNGAM data models but the search method in FGS lacks the ability to recover the true causal graphs. This is verified by NOTEARS that optimizes the least squares loss and achieves a good performance on both data models. We also notice that RLHSIC performs poorly on linearGaussian data, possibly due to the nonidentifiability of linearGaussian models; on all generated datasets, the true causal graphs have an average score , whereas the output graphs of RLHSIC have an average score . As to training time, RLHSIC and RLBIC are generally finished within an hour, while all other methods spend less than one minute. We believe that the longer training time is worthy in obtaining better causal discovery results. Further optimization of our implementation to accelerate training is also possible.
Finally, we test the proposed method on larger graphs with nodes, where we set and . On LiNGAM datasets, RLBIC has FDR, TPR and SHD being , and , slightly worse than NOTEARS with , and .
6.2 Nonlinear Models with Quadratic Functions
We now consider nonlinear causal relationships, and for simplicity, we focus on quadratic functions in this paper. We generate a lower triangular matrix in a similar way to the first experiment. For a causal relationship with parents at the th node, we expand to contain both first and secondorder feature terms. The coefficient for each term is then either or sampled from , with equal probability. If a parent variable does not appear in any term that has a nonzero coefficient, then we remove the corresponding edge in the causal graph. The rest follows the same as in first experiment and here we use the nonGaussian noise model.
For RLHSIC and RLBIC, we can use quadratic regression for a given causal relationship and calculate the scores as previously described. For pruning, we simply apply thresholding, with threshold as , to the estimated coefficients of both first and secondorder terms. Note that if the coefficient of a secondorder term, e.g., , is nonzero after thresholding, then there exist two directed edges that are from to and from to , respectively. In this experiment, we only consider NOTEARS and ICALiNGAM, as FGS does not perform well with nonGaussian noise. For comparison, we apply the same pruning method to the outputs of NOTEARS and ICALiNGAM, denoted by NOTEARS2 and ICALiNGAM2, respectively. Notice that NOTEARS cannot be easily modified to utilize the knowledge that the causal relationships are quadratic functions, due to the acyclicity constraint on the graph.
We report the results with node graphs in Table 2. RLBIC outperforms all other methods in this experiment, thanks to the flexible score selection and improved search ability from the RL paradigm.
RLHSIC  RLBIC  NOTEARS  NOTEARS2  ICALiNGAM  ICALiNGAM2  

FDR  
TPR  
SHD 
6.3 Sociology Dataset
Following Shimizu et al. (2011), we analyze a real dataset taken from the General Social Survey sociological data repository^{6}^{6}6 http://www.norc.org/GSS+Website. The data consist of six observed variables, as shown in Figure 3. The sample size is 1380. The causal relationships based on domain knowledge Duncan et al. (1972) are given in Figure 3(a), which indicate existence of some latent confounders.
Since data variables have different scales in this dataset, we normalize the data before applying RLBIC. RLHSIC is robust to data scales as we use the normalized HSIC in the score function. For both methods, we set and . To reduce possible effect from confounders, our final output is selected by majority vote on the top 5 graphs, ranked according to their rewards. The estimated causal graphs are shown in Figures 3(c) and 3(d). Compared with previous best result in Figure 3(b) obtained by DirectLiNGAM with adaptive LASSO Shimizu et al. (2011), RLHSIC discovers an additional edge that is reasonable to domain knowledge. RLBIC has three correct edges that are also discovered by DirectLiNGAM. All the three methods induce a reversed edge. NOTEARS with normalized data results in too many spurious edges, whereas with original data, it discovers two correct edges and none reversed edges. The estimates obtained by ICALiNGAM, PC and GES can be found in Shimizu et al. (2011), which either have less correct directed edges or several unoriented edges.
Both RLHSIC and RLBIC used a linear regression in score computations. We also tried random forest and FNN for regression, but did not get better results. This is possibly because the relationships are in different function forms and/or the model assumptions are somewhat violated in the real data.
7 Conclusion and Future Works
We have proposed to use RL to search for the DAG with the optimal score. Our reward is designed to incorporate a predefined score function and two penalty terms to enforce acyclicity. We use the actorcritic algorithm as our RL algorithm, where the actor is constructed based on recent encoderdecoder models. Experiments are conducted on both synthetic and real data to show the advantages of our method over other causal discovery methods, including FGS, LiNGAM, and NOTEARS.
There are some limitations with the present work. The first limitation is the scalability of the proposed method to large graphs. Though we have shown the effectiveness of our method on node graphs, it becomes more difficult for graphs with more than nodes. A future work is to incorporate prior knowledge or use constraintbased methods to reduce the search space. Another major limitation is the training time, which is much longer than other methods like LiNGAM and FGS. We find that computing scores for given graphs are much more timeconsuming than training neural networks. Designing a powerful and efficient score function is key to the proposed RL based approach.
References
 Bello et al. (2016) I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio. Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940, 2016.
 Bollen (1989) K. A. Bollen. Structural Equations with Latent Variables. Wiley, 1989.

Chen et al. (2017)
Y. Chen, M. W. Hoffman, S. G. Colmenarejo, M. Denil, T. P. Lillicrap,
M. Botvinick, and N. de Freitas.
Learning to learn without gradient descent by gradient descent.
In
International Conference on Machine Learning
, 2017. 
Chickering (1996)
D. M. Chickering.
Learning Bayesian networks is NPcomplete.
In Learning from Data, pages 121–130. Springer, 1996.  Chickering (2002) D. M. Chickering. Optimal structure identification with greedy search. Journal of Machine Learning Research, 3(Nov):507–554, 2002.
 Chickering et al. (2004) D. M. Chickering, D. Heckerman, and C. Meek. Largesample learning of Bayesian networks is NPhard. Journal of Machine Learning Research, 5(Oct):1287–1330, 2004.
 De Lozzo and Marrel (2016) M. De Lozzo and A. Marrel. New improvements in the use of dependence measures for sensitivity analysis and screening. Journal of Statistical Computation and Simulation, 86(15):3038–3058, 2016.
 Duncan et al. (1972) O. D. Duncan, D. L. Featherman, and B. Duncan. Socioeconomic background and achievement. Seminar Press, 1972.

Geiger and Heckerman (1994)
D. Geiger and D. Heckerman.
Learning Gaussian networks.
In
Conference on Uncertainty in Artificial Intelligence
, 1994.  Gretton et al. (2005) A. Gretton, O. Bousquet, A. Smola, and B. Schölkopf. Measuring statistical dependence with HilbertSchmidt norms. In International Conference on Algorithmic Learning Theory, 2005.
 Han et al. (2016) S. W. Han, G. Chen, M.S. Cheon, and H. Zhong. Estimation of directed acyclic graphs through twostage adaptive lasso for gene network inference. Journal of the American Statistical Association, 111(515):1004–1019, 2016.
 Hoyer et al. (2009) P. O. Hoyer, D. Janzing, J. M. Mooij, J. Peters, and B. Schölkopf. Nonlinear causal discovery with additive noise models. In Advances in Neural Information Processing Systems 21. 2009.
 Huang et al. (2018) B. Huang, K. Zhang, Y. Lin, B. Schölkopf, and C. Glymour. Generalized score functions for causal discovery. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018.
 Hyttinen et al. (2014) A. Hyttinen, F. Eberhardt, and M. Järvisalo. Constraintbased causal discovery: Conflict resolution with answer set programming. In Conference on Uncertainty in Artificial Intelligence, 2014.
 Kingma and Ba (2014) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 OpgenRhein and Strimmer (2007) R. OpgenRhein and K. Strimmer. From correlation to causation networks: a simple approximate learning algorithm and its application to highdimensional plant gene expression data. BMC systems biology, 1(1):37, 2007.
 Pearl (2009) J. Pearl. Causality. Cambridge University Press, 2009.
 Peters et al. (2017) J. Peters, D. Janzing, and B. Schölkopf. Elements of Causal Inference  Foundations and Learning Algorithms. Adaptive Computation and Machine Learning Series. The MIT Press, Cambridge, MA, USA, 2017.

Ramsey et al. (2017)
J. Ramsey, M. Glymour, R. SanchezRomero, and C. Glymour.
A million variables and more: the fast greedy equivalence search
algorithm for learning highdimensional graphical causal models, with an
application to functional magnetic resonance images.
International Journal of Data Science and Analytics
, 3(2):121–129, 2017.  Schwarz et al. (1978) G. Schwarz et al. Estimating the dimension of a model. The Annals of Statistics, 6(2):461–464, 1978.
 Shimizu et al. (2006) S. Shimizu, P. O. Hoyer, A. Hyvärinen, and A. Kerminen. A linear nonGaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(Oct):2003–2030, 2006.
 Shimizu et al. (2011) S. Shimizu, T. Inazumi, Y. Sogawa, A. Hyvärinen, Y. Kawahara, T. Washio, P. O. Hoyer, and K. Bollen. Directlingam: A direct method for learning a linear nonGaussian structural equation model. Journal of Machine Learning Research, 12(Apr):1225–1248, 2011.
 Silver et al. (2017) D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
 Spirtes et al. (2000) P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT press, Cambridge, MA, USA, 2nd edition, 2000.
 Sun et al. (2007) X. Sun, D. Janzing, B. Schölkopf, and K. Fukumizu. A kernelbased causal learning algorithm. In International Conference on Machine Learning, 2007.
 Sutskever et al. (2014) I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, 2014.
 Sutton et al. (2000) R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, 2000.
 Tsamardinos et al. (2006) I. Tsamardinos, L. E. Brown, and C. F. Aliferis. The maxmin hillclimbing bayesian network structure learning algorithm. Machine learning, 65(1):31–78, 2006.
 Vaswani et al. (2017) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 2017.
 Vinyals et al. (2015) O. Vinyals, M. Fortunato, and N. Jaitly. Pointer networks. In Advances in Neural Information Processing Systems, 2015.
 Williams (1992) R. J. Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 8(34):229–256, 1992.
 Zhang and Hyvärinen (2009) K. Zhang and A. Hyvärinen. On the identifiability of the postnonlinear causal model. In Conference on Uncertainty in Artificial Intelligence, 2009.
 Zhang et al. (2012) K. Zhang, J. Peters, D. Janzing, and B. Schölkopf. Kernelbased conditional independence test and application in causal discovery. In Conference on Uncertainty in Artificial Intelligence, 2012.
 Zheng et al. (2018) X. Zheng, B. Aragam, P. Ravikumar, and E. P. Xing. DAGs with NO TEARS: Continuous optimization for structure learning. In Advances in Neural Information Processing Systems, 2018.
 Zoph and Le (2016) B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.