1 Introduction
Learning DAGs from data is an important and classical problem in machine learning, with a diverse array of applications in causal inference
(spirtes2000), fairness and accountability (kusner2017), medicine (heckerman1992), and finance (sanford2012). In addition to their undirected counterparts, DAG models offer a parsimonious, interpretable representation of a joint distribution that is useful in practice. Unfortunately, existing methods for learning DAGs typically rely on specific model assumptions (e.g. linear or additive) and specialized algorithms (e.g. constraintbased or greedy optimization) that are not broadly applicable to different data. As a result, the burden is on the user to choose amongst many possible models and algorithms, which requires significant expertise. Thus, there is a need for a general framework for learning different DAG models—subsuming, for example, linear, parametric, and nonparametric—that does not require specialized algorithms. Ideally, the problem could be formulated as a conventional optimization problem that can be tackled with general purpose solvers, much like the current stateoftheart for undirected graphical models
(e.g. suggala2017expxorcist; yang2015; liu2009; hsieh2013; banerjee2008).In this paper, we develop such a general algorithmic framework for scorebased learning of DAG models. This framework is flexible enough to learn general nonparametric dependence while also easily adapting to parametric and semiparametric models, including nonlinear models. The framework is based on a recent algebraic characterization of acyclicity due to zheng2018dags that recasts the scorebased optimization problem as a continuous problem, instead of the traditional combinatorial approach. This allows generic optimization routines to be used in minimizing the score, providing a clean conceptual formulation of the problem that can be approached using any of the wellknown algorithms from the optimization literature. This existing work relies heavily on a linear parametrization in terms of a weighted adjacency matrix of the problem, which however can be seen to be a stringent restriction on the class of models, and one of the key technical contributions of the current work is extending this to general nonparametric problems, where no such parametrization in terms of a weighted adjacency matrix exists.
Contributions
Our main contributions can be summarized as follows:

We develop a generic optimization problem that can be applied to nonlinear and nonparametric SEM and discuss various special cases including additive models and index models. We show how this optimization problem can be solved to stationarity with generic solvers, eliminating the necessity for specialized algorithms.

We extend the existing smooth characterization of acyclicity from (zheng2018dags) to general nonparametric models, and apply this to several popular examples for modeling nonlinear dependencies (Section 3).

We consider in detail two classes of nonparametric estimators defined through 1) Neural networks and 2) Orthogonal basis expansions, and study their properties (Section
4). 
We run extensive empirical evaluations on a variety of nonparametric and semiparametric models against recent stateoftheart methods in order to demonstrate the effectiveness and generality of our framework (Section 5).
As with all scorebased approaches to learning DAGs, ours relies on a nonconvex optimization problem. Despite this, we show that offtheshelf solvers return stationary points that outperform other stateoftheart methods. Finally, the algorithm itself can be implemented in standard machine learning libraries such as PyTorch, which we hope will help the community to extend and generalize our approach to richer models moving forward.
Related work
The problem of learning nonlinear and nonparametric DAGs from data has generated significant interest in recent years, including additive models (buhlmann2014cam; voorman2014graph; ernest2016), additive noise models (hoyer2009; peters2014causal; blobaum2018cause; mooij2016distinguishing), postnonlinear models zhang2009identifiability; zhang2016estimation and general nonlinear SEM (monti2019causal; goudet2018learning; kalainathan2018SAM; sgouritsa2015inference). Recently, yu2019dag proposed to use graph neural networks for nonlinear measurement models and huang2018generalized proposed a generalized score function for general SEM. The latter work is based on recent work in kernelbased measures of dependence (gretton2005measuring; fukumizu2008kernel; zhang2012kernel)
. Another line of work uses quantile scoring
(tagasovska2018nonparametric). Also of relevance is the literature on nonparametric variable selection (bertin2008selection; lafferty2008rodeo; miller2010local; rosasco2013nonparametric; gregorova2018structured) and approaches based on neural networks (feng2017sparse; ye2018variable; abid2019concrete). The main distinction between our work and this previous work is that our framework is not tailored to a specific model class, as our focus is on a generic formulation of an optimization problem that can be solved with generic solvers. This also distinguishes this paper from concurrent work by lachapelle2019gradientthat focuses on neural networkbased nonlinearities in the local conditional probabilities. As such, we hope that this work is able to spur future work using more sophisticated nonparametric estimators and optimization schemes.
Notation
Norms will always be explicitly subscripted to avoid confusion: is the
norm on vectors,
is the norm on functions, is the norm on matrices, and is the matrix Frobenius norm. For functions and a matrix , we adopt the convention that is the vector whose th element is , where is the th row of .2 Background
Our approach is based on (acyclic) structural equation models as follows. Let be a random vector and a DAG with . We assume that there exist functions ^{2}^{2}2The reason for writing instead of is to simplify notation by ensuring each is defined on the same space. and such that
(1) 
Here, denotes the parents of in . Formally, the second line in (1) means that for any , the function is constant for all . Thus, encodes the conditional independence structure of . The functions , which are typically known, allow for possible nonadditive errors such as in generalized linear models (GLMs). The model (1) is quite general and includes additive noise models, linear and generalized linear models, and additive models as special cases (Section 3.3).
When the model (1) holds, the graph is not necessarily uniquely defined: A wellknown example is when
is jointly normally distributed, in which case the
are linear functions, and where it can be shown that the graph is not uniquely specified (kagan1973). Fortunately, it is known that this case is somewhat exceptional: Assuming additive noise, as long as the errors are nonGaussian (kagan1973) or the functions are nonlinear (hoyer2009; zhang2009; peters2014causal), then the graph is generally identifiable. We refer the reader to (peters2014causal) for details. In the sequel, we will assume that the graph is uniquely defined from the model (1), and this dependence will be emphasized by writing . Similarly, any collection of functions defines a graph in the obvious way.In this setting, the DAG learning problem can be stated as follows: Given a data matrix consisting of i.i.d. observations of the model (1), we seek to learn the DAG that encodes the dependency between the variables in . Our approach is to learn such that
using a scorebased approach. Given a loss function
such as least squares or the negative loglikelihood, we consider the following program:(2) 
There are two important challenges in this formulation: 1) How to enforce the acyclicity constraint that , and 2) How to enforce sparsity in the learned DAG ? Previous work using linear and generalized linear models rely on a parametric representation of via a weighted adjacency matrix , which is no longer welldefined in the model (1). Our key strategy will be to apply the trace exponential regularizer developed by zheng2018dags to a suitable surrogate of defined for general nonparametric models.
It is instructive at this point to highlight the main distinction between our approach and existing approaches. A common approach is to assume the are easily parametrized (e.g. linearity) (zheng2018dags; aragam2015; gu2018; park2017; park2018learning; chen2018causal; ghoshal2017ident). In this case, one can easily encode the structure of via, e.g. a weighted adjacency matrix, and learning reduces to a parametric estimation problem. Nonparametric extensions of this approach include additive models (buhlmann2014cam; voorman2014graph), where the graph structure is easily deduced from the additive structure of the . An alternative approach relies on exploiting the conditional independence structure of , such as the postnonlinear model (zhang2009identifiability; yu2019dag), the additive noise model (peters2014causal), and kernelbased measures of conditional independence (huang2018generalized). Our framework can be viewed as a significant generalization of the former approach: We use partial derivatives to measure dependence in the general nonparametric model (1
), and do not explicitly require any of the machinery of nonparametric conditional independence (although we note in some places this machinery is implicit). This allows us to use nonparametric estimators such as multilayer perceptrons and basis expansions, for which these derivatives are easily computed. As a result, the scorebased learning problem is reduced to an optimization problem that can be tackled using existing techniques, making our approach easily accessible.
3 Characterizing acyclicity in nonparametric SEM
In this section, we discuss how to extend the trace exponential regularizer from zheng2018dags beyond the linear setting, and then discuss several special cases.
3.1 Linear SEM and the trace exponential regularizer
We begin by briefly reviewing zheng2018dags in the linear case, i.e. and for some . This defines a matrix that precisely encodes the graph , i.e. there is an edge in if and only if . In this case, we can formulate the entire problem in terms of : If , then optimizing is equivalent to optimizing over linear functions. Define the function , where . Then zheng2018dags show that (2) is equivalent to
(3) 
The key insight from zheng2018dags is replacing the combinatorial constraint with the continuous constraint . Our goal is to define a suitable surrogate of for general nonparametric models, so that the same continuous program can be used to optimize (2).
3.2 A notion of nonparametric acyclicity
Unfortunately, for general models of the form (1), there is no , and hence the trace exponential formulation seems to break down. To remedy this, we use partial derivatives to measure the dependence of on the th variable, an idea that dates back to at least rosasco2013nonparametric. First, we need to make precise the spaces we are working on: Let denote the usual Sobolev space of squareintegrable functions whose derivatives are also square integrable (for background on Sobolev spaces see tsybakov2009introduction). Assume hereafter that and denote the partial derivative with respect to by . It is then easy to show that is independent of if and only if , where is the usual norm. This observation implies that the following matrix precisely encodes the dependency structure amongst the :
(4) 
Thus the program (2) is equivalent to
(5) 
This implies an equivalent continuous formulation of the program (2). Moreover, when the functions are all linear, is the same as the weighted adjacency matrix defined in Section 3.1. Thus, (5) is a genuine generalization of the linear case, and reduces to (3) in this case.
3.3 Special cases
In addition to applying to general nonparametric models of the form (2) and linear models, the program (5) applies to a variety of parametric and semiparametric models including additive noise models, generalized linear models, additive models, and index models.
Additive noise models
Generalized linear models
A traditional GLM assumes that for some known link functions and . For example, for
we can use logistic regression with link functions
. This is easily generalized to nonparametric mean functions by setting(7) 
Clearly, (6) is a special case of (7). Furthermore, for linear mean functions, if and only if , recovering the parametric approach in (zheng2018dags).
Polynomial regression
In polynomial regression, we assume that is a polynomial in . More generally, given a known dictionary of functions , we require that . Then it is easy to check that if and only if whenever depends on .
Additive models
In an additive model (hastie1987generalized; ravikumar2009sparse), we assume that for some . Then it is straightforward to show that if and only if . In other words, if and only if .
Index models
The multiple index model (alquier2013; yuan2011) assumes for some and . As long as is sufficiently large, these functions are universal approximators (diaconis1984nonlinear). When , this is known as a singleindex model. As long as the functions () are linearly independent, it is straightforward to show that if and only if for each . In other words, if and only if .
4 Optimization
In general, the program (5) is infinitedimensional, so in this section we discuss different ways to reduce this to a tractable, finitedimensional optimization problem. One of the advantages of encoding dependence via is that it provides a plugandplay framework for plugging in various nonparametric estimators whose derivatives can be computed. We will illustrate two examples using multilayer perceptrons and orthogonal basis expansions, however, we emphasize that it is straightforward to implement other differentiable models for the . These flexible nonparametric estimators will help reduce (5) to a straightforward optimization problem, as we discuss at the end of this section.
The basic recipe is the following:

Choose a model family for the conditional expectations (e.g. general nonparametric, additive, index, etc.);

Choose a suitable family of approximations (e.g. neural networks, orthogonal series, etc.);

Translate the loss function and constraint into parametric forms and using the approximating family;

Solve the resulting finitedimensional problem.
Step 3 above is the key step that enables transforming (5) into a tractable optimization problem. By approximating the with a flexible family of functions parametrized by , we can replace the infinitedimensional quantity with the simpler . As is standard in the literature on nonparametric estimation, the dimension of is allowed to depend on , although this dependence will be suppressed.
4.1 Multilayer perceptrons
We first consider the use of neural networks to approximate the , as in an ANM (6) or GLM (7). Consider a multilayer perceptron (MLP) with hidden layers and a single activation , given by
By increasing the capacity of the MLP (e.g. increasing the number of layers or the number of hidden units in each layer), we can approximate any arbitrarily well.
First, we must determine under what conditions is independent of —this is important both for enforcing acyclicity and sparsity. It is not hard to see that if the th column of consists of all zeros (i.e. for all ), then will be independent of . In fact, we have the following proposition, which implies that this constraint precisely identifies the set of MLPs that are independent of : Consider the function class of all MLPs that are independent of and the function class of all MLPs such that the th column of consists of all zeros. Then .
Proof.
For completeness, note that
and  
We omit the bias terms in each layer as it does not affect the statement.
We will show that and .
(1) : for any , we have , where for all . Hence the linear function is independent of . Therefore,
is also independent of , which means .
(2) : for any , we have and is independent of . We will show that by constructing a matrix , such that
(8) 
and for all .
Let be the vector such that and for all . Since and differ only on the th dimension, and is independent of , we have
(9) 
Now define be the matrix such that and for all . Then we have the following observation: for each entry ,
(10) 
Hence,
(11) 
Therefore, by (9)
By definition of , we know that . Thus, and we have completed the proof. ∎
This important proposition provides a rigorous way to enforce that an MLP approximation depends only on a few coordinates. Indeed, it is clear that constraining for each will remove the dependence on , however, there is a concern that we could lose the expressivity of multiple hidden layers in doing so. Fortunately, this proposition implies that there is in fact no loss of expressivity or approximating power. Furthermore, it follows that if .
Let denote the parameters for the th MLP and . Define . The problem (2) is now reduced to
(12) 
4.2 Basis expansions
As an alternative to neural networks, we also consider the use of orthogonal basis expansions (schwartz1967estimation; wahba1981data; hall1987cross; efromovich2008nonparametric). While many techniques are valid here, we adopt an approach based on ravikumar2009sparse. Let be an orthonormal basis of such that for each . Then any can be written uniquely
(13) 
As long as the coefficients decay sufficiently fast, can be wellapproximated by the finite series . Similar claims are true for onedimensional Sobolev functions, which applies to both additive (i.e. for ) and index (i.e. for ) models.
We illustrate here an application with additive models and onedimensional expansions. It is straightforward to extend these ideas to more general models using a tensor product basis, though this quickly becomes computationally infeasible. For more on highdimensional orthogonal series, see
lee2016spectral. Thus,(14) 
Given integers and assuming is sufficiently smooth, we have (efromovich2008nonparametric), so that the overall approximation error is on the order . Furthermore, for all . Since we are discarding terms for , in practice it suffices to check that for , or .
Letting denote the parameters for all , it thus suffices to define for the purposes of checking acyclicity. Let be the matrix . To estimate the coefficients , we solve
(15) 
This is similar to ravikumar2009sparse with the addition of an explicit constraint. The first regularization term in (15) can be interpreted as a functional norm.
4.3 Solving the continuous program
Having converted and to their finitedimensional counterparts, we are now ready to solve (5) by applying standard optimization techniques. We emphasize that the hard work went into formulating the programs (12) and (15) as generic problems for which offtheshelf solvers can be used. Importantly, since in both (12) and (15) the term is differentiable , the optimization program is an penalized smooth minimization under a differentiable equality constraint. As in zheng2018dags, the standard machinery of augmented Lagrangian can be applied, resulting in a series of unconstrained problems:
(16) 
where is the penalty parameter and is the dual variable.
A number of optimization algorithms can be applied to the above unconstrained penalized smooth minimization problem. A natural choice is the LBFGSB algorithm (byrd1995limited), which can be directly applied by casting (16) into a boxconstrained form:
(17) 
where is a vector of all ones. We note that as in zheng2018dags, (16) is a nonconvex program, and at best can be solved to stationarity. Our experiments indicate that this nonetheless leads to competitive and often superior performance in practice.
5 Experiments
We study the empirical performance of two instances of the general framework: MLP (4.1) and Sobolev basis expansion (4.2), denoted as and respectively. For
we use an MLP with one hidden layer with 10 hidden units and sigmoid activation function. For
we use Sobolev basis , for .Baselines
For comparison, the following methods are chosen as baselines:

Fast greedy equivalence search (ramsey2017million) is based on greedy search and assumes linear dependency between variables. The implementation is available at https://github.com/bd2kccd/pycausal.

Greedy equivalence search with generalized scores (huang2018generalized) is also based on greedy search, but uses generalized scores without assuming a particular model class. The implementation is available at https://github.com/BiweiHuang/GeneralizedScoreFunctionsforCausalDiscovery/.

DAGGNN (yu2019dag) learns a (noisy) nonlinear transformation of a linear SEM using neural networks. The implementation is available at https://github.com/fishmoon1234/DAGGNN.

NOTEARS (zheng2018dags) learns a linear SEM using continuous optimization. The implementation is available at https://github.com/xunzheng/notears.
To summarize, and are specialized at linear models, whereas and targets general nonlinear dependencies. Comparisons with other scorebased methods (KGV score (bach2003learning), Spearman correlation (sokolova2014causal)) and constraintbased methods (PC (spirtes2000), MMMB (aliferis2010local)) can be found from previous works such as huang2018generalized, hence we omit in this paper. For all experiments, default parameter settings are used.
Simulation
The ground truth DAG is generated from two random graph models: ErdosRenyi (ER) and scalefree (SF). We use ER2 to denote an ER graph with edges, likewise for SF. Given the ground truth DAG, we simulate the SEM for all in topological order. To evaluate the performance under different data generation mechanisms, we consider four models for the :

Additive GP: , where each is a draw from Gaussian process with RBF kernel with lengthscale one.

Index model: , where , , , and each is drawn uniformly from range .

MLP: is a randomly initialized MLP with one hidden layer of size 100 and sigmoid activation.

GP: is a draw from Gaussian process with RBF kernel with lengthscale one.
In all settings, is standard Gaussian noise.
Metrics
We evaluate the estimated DAG structure using the following common metrics: false discovery rate (FDR), true positive rate (TPR), false positive rate (FPR), and structural Hamming distance (SHD). Note that both and return a CPDAG that may contain undirected edges, in which case we evaluate them favorably by assuming correct orientation for undirected edges whenever possible.
5.1 Structure learning
In this experiment we examine the structure recovery of different methods by comparing the DAG estimates against the ground truth. We simulate {ER1, ER2, ER4, SF1, SF2, SF4} graphs with nodes. For each graph,
data samples are generated. The above process is repeated 10 times and we report the mean and standard deviations of the results. For
and , are used for respectively.Figure 1 shows the resulting SHD in various settings; the complete set of results for the remaining metrics are deferred to the supplement. Overall, the proposed method attains the best SHD (lower the better) across a wide range of settings, particularly when the data generating mechanism is an MLP or an index model. One can also observe that the performance of stays stable for different graph types with varying density and degree distribution, as it does not make explicit assumptions on the topological properties of the graph such as density or degree distribution. Not surprisingly, performs well when the underlying SEM is additive GP. On the other hand, when the ground truth is not an additive model, the performance of degrades as expected. Finally, we observe that outperforms and on GP, which is a nonparametric setting in which a kernelbased dependency measure can excel, however, we note that the kernelbased approach accompanies an time complexity, compared to linear dependency on in and . For instance, the average runtime of on ER2 with , is over 90 minutes, whereas takes less than five minutes on average. This is consistent with the observation from Table 1, which contains runtime comparison of different algorithms on ER2 graph with samples. Also, with by properly tuning the regularization parameter, the performance of
for each individual setting can be improved considerably, for example in the GP setting. Since such hyperparameter tuning is not the main focus of this paper, we fix a reasonable
for all settings (see Appendix A for more discussion).92.12 22.51  62.90 16.83  0.55 0.43  10.95 4.52  498.32 43.72  1547.42 109.83  
282.64 67.46  321.88 57.33  0.59 0.17  43.15 12.43  706.35 64.49  6379.98 359.67 
5.2 Sensitivity to number of hidden units
We also investigated the effect of number of hidden units in the estimate. It is wellknown that as the size of the hidden layer increases, the functions representable by an MLP become more flexible. On the other hand, larger networks require more samples to estimate the parameters. Indeed, Figure 2 confirms this intuition. We plot the SHD with varying number of hidden units ranging from zero (linear function) to 100 units, using and samples generated from the additive GP model on SF2 graph with
nodes. One can first observe a sharp phase transition between zero and very few hidden units, which suggests the power of nonlinearity. Moreover, as the number of hidden units increases to 20, the performance for both
and steadily improves, in which case the increased flexibility brings benefit. However, as we further increase the number of hidden units, while SHD for remains similar, the SHD for deteriorates, hinting at the lack of samples to take advantage of the increased flexibility.5.3 Real data
Finally, we evaluated on a real dataset from sachs2005causal that is commonly used as a benchmark as it comes with a consensus network that is accepted by the biological community. The dataset consists of continuous measurements of expression levels of proteins and phospholipids in human immune system cells for cell types. We report an SHD of 16 with 13 edges estimated by . In comparison, NOTEARS predicts 16 edges with SHD of 22 and predicts 18 edges that attains SHD of 19. (Due to the large number of samples, we could not run on this dataset.) Among the 13 edges predicted by , 7 edges agree with the consensus network: raf mek, mek erk, PLCg PIP2, PIP3 PLCg, PIP3 PIP2, PKC mek, PKC jnk; and 3 edges are predicted but in a reversed direction: raf PKC, akt erk, p38 PKC. Among the true positives, 3 edges are not found by other methods: mek erk, PIP3 PLCg, PKC mek.
6 Discussion
We present a framework for scorebased learning of sparse directed acyclic graphical models that subsumes many popular parametric, semiparametric, and nonparametric models as special cases. The key technical device is a notion of nonparametric acyclicity that leverages partial derivatives in the algebraic characterization of DAGs. With a suitable choice of the approximation family, the estimation problem becomes a finitedimensional differentiable program that can be solved by standard optimization algorithms. An interesting direction for future work is to study the nonconvex landscape of (16) more carefully in order to provide rigourous guarantees on the optimality of the solutions found by our approach.
References
Appendix A Additional results
We show FDR, TPR, FPR results in Figure 3, 4, 5 respectively. As in Figure 1, each row is a random graph model, each column is a type of SEM, and left and right block correspond to and . Overall has low FDR/FPR and high TPR, and same for on additive GP. Also observe that in most settings has low FDR as well as low TPR, which is a consequence of only predicting a small number of edges.
Comments on hyperparameter tuning
The experiments presented in this paper were conducted under a fixed (and therefore suboptimal) value of and weight threshold across all graph types, sparsity levels, and SEM types, despite the fact that each configuration may prefer different regularization strengths. Indeed, we observe substantially improved performance by choosing different values of hyperparameters in some settings. As our focus is not on attaining the best possible accuracy in all settings by carefully tuning the hyperparameters, we omit these results in the main text and only include here as a supplement. For instance, for ER4 graph with variables and samples, when the SEM is additive GP and MLP, setting and threshold = 0.5 gives results summarized in Table 2.
SEM  Method  SHD  FDR  TPR  FPR  Predicted # 

AdditiveGP  124.3 6.65  0.30 0.07  0.35 0.04  0.04 0.01  81.70 10.49  
121.3 5.02  0.36 0.05  0.28 0.03  0.04 0.00  69.30 5.01  
MLP  88.40 11.29  0.18 0.08  0.57 0.06  0.03 0.02  111.70 15.97  
121.60 11.95  0.33 0.09  0.33 0.06  0.04 0.01  77.10 7.13 
Comments
There are no comments yet.