Learning Sparse Nonparametric DAGs

09/29/2019 ∙ by Xun Zheng, et al. ∙ Carnegie Mellon University The University of Chicago Booth School of Business 18

We develop a framework for learning sparse nonparametric directed acyclic graphs (DAGs) from data. Our approach is based on a recent algebraic characterization of DAGs that led to the first fully continuous optimization for score-based learning of DAG models parametrized by a linear structural equation model (SEM). We extend this algebraic characterization to nonparametric SEM by leveraging nonparametric sparsity based on partial derivatives, resulting in a continuous optimization problem that can be applied to a variety of nonparametric and semiparametric models including GLMs, additive noise models, and index models as special cases. We also explore the use of neural networks and orthogonal basis expansions to model nonlinearities for general nonparametric models. Extensive empirical study confirms the necessity of nonlinear dependency and the advantage of continuous optimization for score-based learning.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning DAGs from data is an important and classical problem in machine learning, with a diverse array of applications in causal inference

(spirtes2000), fairness and accountability (kusner2017), medicine (heckerman1992), and finance (sanford2012)

. In addition to their undirected counterparts, DAG models offer a parsimonious, interpretable representation of a joint distribution that is useful in practice. Unfortunately, existing methods for learning DAGs typically rely on specific model assumptions (e.g. linear or additive) and specialized algorithms (e.g. constraint-based or greedy optimization) that are not broadly applicable to different data. As a result, the burden is on the user to choose amongst many possible models and algorithms, which requires significant expertise. Thus, there is a need for a general framework for learning different DAG models—subsuming, for example, linear, parametric, and nonparametric—that does not require specialized algorithms. Ideally, the problem could be formulated as a conventional optimization problem that can be tackled with general purpose solvers, much like the current state-of-the-art for undirected graphical models

(e.g. suggala2017expxorcist; yang2015; liu2009; hsieh2013; banerjee2008).

In this paper, we develop such a general algorithmic framework for score-based learning of DAG models. This framework is flexible enough to learn general nonparametric dependence while also easily adapting to parametric and semiparametric models, including nonlinear models. The framework is based on a recent algebraic characterization of acyclicity due to zheng2018dags that recasts the score-based optimization problem as a continuous problem, instead of the traditional combinatorial approach. This allows generic optimization routines to be used in minimizing the score, providing a clean conceptual formulation of the problem that can be approached using any of the well-known algorithms from the optimization literature. This existing work relies heavily on a linear parametrization in terms of a weighted adjacency matrix of the problem, which however can be seen to be a stringent restriction on the class of models, and one of the key technical contributions of the current work is extending this to general nonparametric problems, where no such parametrization in terms of a weighted adjacency matrix exists.


Our main contributions can be summarized as follows:

  • We develop a generic optimization problem that can be applied to nonlinear and nonparametric SEM and discuss various special cases including additive models and index models. We show how this optimization problem can be solved to stationarity with generic solvers, eliminating the necessity for specialized algorithms.

  • We extend the existing smooth characterization of acyclicity from (zheng2018dags) to general nonparametric models, and apply this to several popular examples for modeling nonlinear dependencies (Section 3).

  • We consider in detail two classes of nonparametric estimators defined through 1) Neural networks and 2) Orthogonal basis expansions, and study their properties (Section 


  • We run extensive empirical evaluations on a variety of nonparametric and semiparametric models against recent state-of-the-art methods in order to demonstrate the effectiveness and generality of our framework (Section 5).

As with all score-based approaches to learning DAGs, ours relies on a nonconvex optimization problem. Despite this, we show that off-the-shelf solvers return stationary points that outperform other state-of-the-art methods. Finally, the algorithm itself can be implemented in standard machine learning libraries such as PyTorch, which we hope will help the community to extend and generalize our approach to richer models moving forward.

Related work

The problem of learning nonlinear and nonparametric DAGs from data has generated significant interest in recent years, including additive models (buhlmann2014cam; voorman2014graph; ernest2016), additive noise models (hoyer2009; peters2014causal; blobaum2018cause; mooij2016distinguishing), post-nonlinear models zhang2009identifiability; zhang2016estimation and general nonlinear SEM (monti2019causal; goudet2018learning; kalainathan2018SAM; sgouritsa2015inference). Recently, yu2019dag proposed to use graph neural networks for nonlinear measurement models and huang2018generalized proposed a generalized score function for general SEM. The latter work is based on recent work in kernel-based measures of dependence (gretton2005measuring; fukumizu2008kernel; zhang2012kernel)

. Another line of work uses quantile scoring

(tagasovska2018nonparametric). Also of relevance is the literature on nonparametric variable selection (bertin2008selection; lafferty2008rodeo; miller2010local; rosasco2013nonparametric; gregorova2018structured) and approaches based on neural networks (feng2017sparse; ye2018variable; abid2019concrete). The main distinction between our work and this previous work is that our framework is not tailored to a specific model class, as our focus is on a generic formulation of an optimization problem that can be solved with generic solvers. This also distinguishes this paper from concurrent work by lachapelle2019gradient

that focuses on neural network-based nonlinearities in the local conditional probabilities. As such, we hope that this work is able to spur future work using more sophisticated nonparametric estimators and optimization schemes.


Norms will always be explicitly subscripted to avoid confusion: is the

-norm on vectors,

is the -norm on functions, is the -norm on matrices, and is the matrix Frobenius norm. For functions and a matrix , we adopt the convention that is the vector whose th element is , where is the th row of .

2 Background

Our approach is based on (acyclic) structural equation models as follows. Let be a random vector and a DAG with . We assume that there exist functions 222The reason for writing instead of is to simplify notation by ensuring each is defined on the same space. and such that


Here, denotes the parents of in . Formally, the second line in (1) means that for any , the function is constant for all . Thus, encodes the conditional independence structure of . The functions , which are typically known, allow for possible non-additive errors such as in generalized linear models (GLMs). The model (1) is quite general and includes additive noise models, linear and generalized linear models, and additive models as special cases (Section 3.3).

When the model (1) holds, the graph is not necessarily uniquely defined: A well-known example is when

is jointly normally distributed, in which case the

are linear functions, and where it can be shown that the graph is not uniquely specified (kagan1973). Fortunately, it is known that this case is somewhat exceptional: Assuming additive noise, as long as the errors are non-Gaussian (kagan1973) or the functions are nonlinear (hoyer2009; zhang2009; peters2014causal), then the graph is generally identifiable. We refer the reader to (peters2014causal) for details. In the sequel, we will assume that the graph is uniquely defined from the model (1), and this dependence will be emphasized by writing . Similarly, any collection of functions defines a graph in the obvious way.

In this setting, the DAG learning problem can be stated as follows: Given a data matrix consisting of i.i.d. observations of the model (1), we seek to learn the DAG that encodes the dependency between the variables in . Our approach is to learn such that

using a score-based approach. Given a loss function

such as least squares or the negative log-likelihood, we consider the following program:


There are two important challenges in this formulation: 1) How to enforce the acyclicity constraint that , and 2) How to enforce sparsity in the learned DAG ? Previous work using linear and generalized linear models rely on a parametric representation of via a weighted adjacency matrix , which is no longer well-defined in the model (1). Our key strategy will be to apply the trace exponential regularizer developed by zheng2018dags to a suitable surrogate of defined for general nonparametric models.

It is instructive at this point to highlight the main distinction between our approach and existing approaches. A common approach is to assume the are easily parametrized (e.g. linearity) (zheng2018dags; aragam2015; gu2018; park2017; park2018learning; chen2018causal; ghoshal2017ident). In this case, one can easily encode the structure of via, e.g. a weighted adjacency matrix, and learning reduces to a parametric estimation problem. Nonparametric extensions of this approach include additive models (buhlmann2014cam; voorman2014graph), where the graph structure is easily deduced from the additive structure of the . An alternative approach relies on exploiting the conditional independence structure of , such as the post-nonlinear model (zhang2009identifiability; yu2019dag), the additive noise model (peters2014causal), and kernel-based measures of conditional independence (huang2018generalized). Our framework can be viewed as a significant generalization of the former approach: We use partial derivatives to measure dependence in the general nonparametric model (1

), and do not explicitly require any of the machinery of nonparametric conditional independence (although we note in some places this machinery is implicit). This allows us to use nonparametric estimators such as multilayer perceptrons and basis expansions, for which these derivatives are easily computed. As a result, the score-based learning problem is reduced to an optimization problem that can be tackled using existing techniques, making our approach easily accessible.

3 Characterizing acyclicity in nonparametric SEM

In this section, we discuss how to extend the trace exponential regularizer from zheng2018dags beyond the linear setting, and then discuss several special cases.

3.1 Linear SEM and the trace exponential regularizer

We begin by briefly reviewing zheng2018dags in the linear case, i.e. and for some . This defines a matrix that precisely encodes the graph , i.e. there is an edge in if and only if . In this case, we can formulate the entire problem in terms of : If , then optimizing is equivalent to optimizing over linear functions. Define the function , where . Then zheng2018dags show that (2) is equivalent to


The key insight from zheng2018dags is replacing the combinatorial constraint with the continuous constraint . Our goal is to define a suitable surrogate of for general nonparametric models, so that the same continuous program can be used to optimize (2).

3.2 A notion of nonparametric acyclicity

Unfortunately, for general models of the form (1), there is no , and hence the trace exponential formulation seems to break down. To remedy this, we use partial derivatives to measure the dependence of on the th variable, an idea that dates back to at least rosasco2013nonparametric. First, we need to make precise the spaces we are working on: Let denote the usual Sobolev space of square-integrable functions whose derivatives are also square integrable (for background on Sobolev spaces see tsybakov2009introduction). Assume hereafter that and denote the partial derivative with respect to by . It is then easy to show that is independent of if and only if , where is the usual -norm. This observation implies that the following matrix precisely encodes the dependency structure amongst the :


Thus the program (2) is equivalent to


This implies an equivalent continuous formulation of the program (2). Moreover, when the functions are all linear, is the same as the weighted adjacency matrix defined in Section 3.1. Thus, (5) is a genuine generalization of the linear case, and reduces to (3) in this case.

3.3 Special cases

In addition to applying to general nonparametric models of the form (2) and linear models, the program (5) applies to a variety of parametric and semiparametric models including additive noise models, generalized linear models, additive models, and index models.

Additive noise models

The nonparametric additive noise model (ANM) (hoyer2009; peters2014causal) assumes that


Clearly this is a special case of (1) with . In contrast to the remaining examples below, without additional assumptions, it is not possible to simplify the condition for in (4).

Generalized linear models

A traditional GLM assumes that for some known link functions and . For example, for

we can use logistic regression with link functions

. This is easily generalized to nonparametric mean functions by setting


Clearly, (6) is a special case of (7). Furthermore, for linear mean functions, if and only if , recovering the parametric approach in (zheng2018dags).

Polynomial regression

In polynomial regression, we assume that is a polynomial in . More generally, given a known dictionary of functions , we require that . Then it is easy to check that if and only if whenever depends on .

Additive models

In an additive model (hastie1987generalized; ravikumar2009sparse), we assume that for some . Then it is straightforward to show that if and only if . In other words, if and only if .

Index models

The multiple index model (alquier2013; yuan2011) assumes for some and . As long as is sufficiently large, these functions are universal approximators (diaconis1984nonlinear). When , this is known as a single-index model. As long as the functions () are linearly independent, it is straightforward to show that if and only if for each . In other words, if and only if .

Among these examples, both polynomial regression and GLMs with linear mean function are nonlinear but finite-dimensional, and hence the problem (5) is straightforward to solve (see Section 4.3).

4 Optimization

In general, the program (5) is infinite-dimensional, so in this section we discuss different ways to reduce this to a tractable, finite-dimensional optimization problem. One of the advantages of encoding dependence via is that it provides a plug-and-play framework for plugging in various nonparametric estimators whose derivatives can be computed. We will illustrate two examples using multilayer perceptrons and orthogonal basis expansions, however, we emphasize that it is straightforward to implement other differentiable models for the . These flexible nonparametric estimators will help reduce (5) to a straightforward optimization problem, as we discuss at the end of this section.

The basic recipe is the following:

  1. Choose a model family for the conditional expectations (e.g. general nonparametric, additive, index, etc.);

  2. Choose a suitable family of approximations (e.g. neural networks, orthogonal series, etc.);

  3. Translate the loss function and constraint into parametric forms and using the approximating family;

  4. Solve the resulting finite-dimensional problem.

Step 3 above is the key step that enables transforming (5) into a tractable optimization problem. By approximating the with a flexible family of functions parametrized by , we can replace the infinite-dimensional quantity with the simpler . As is standard in the literature on nonparametric estimation, the dimension of is allowed to depend on , although this dependence will be suppressed.

4.1 Multilayer perceptrons

We first consider the use of neural networks to approximate the , as in an ANM (6) or GLM (7). Consider a multilayer perceptron (MLP) with hidden layers and a single activation , given by

By increasing the capacity of the MLP (e.g. increasing the number of layers or the number of hidden units in each layer), we can approximate any arbitrarily well.

First, we must determine under what conditions is independent of —this is important both for enforcing acyclicity and sparsity. It is not hard to see that if the th column of consists of all zeros (i.e. for all ), then will be independent of . In fact, we have the following proposition, which implies that this constraint precisely identifies the set of MLPs that are independent of : Consider the function class of all MLPs that are independent of and the function class of all MLPs such that the th column of consists of all zeros. Then .


For completeness, note that


We omit the bias terms in each layer as it does not affect the statement.

We will show that and .

(1) : for any , we have , where for all . Hence the linear function is independent of . Therefore,

is also independent of , which means .

(2) : for any , we have and is independent of . We will show that by constructing a matrix , such that


and for all .

Let be the vector such that and for all . Since and differ only on the th dimension, and is independent of , we have


Now define be the matrix such that and for all . Then we have the following observation: for each entry ,




Therefore, by (9)

By definition of , we know that . Thus, and we have completed the proof. ∎

This important proposition provides a rigorous way to enforce that an MLP approximation depends only on a few coordinates. Indeed, it is clear that constraining for each will remove the dependence on , however, there is a concern that we could lose the expressivity of multiple hidden layers in doing so. Fortunately, this proposition implies that there is in fact no loss of expressivity or approximating power. Furthermore, it follows that if .

Let denote the parameters for the th MLP and . Define . The problem (2) is now reduced to


4.2 Basis expansions

As an alternative to neural networks, we also consider the use of orthogonal basis expansions (schwartz1967estimation; wahba1981data; hall1987cross; efromovich2008nonparametric). While many techniques are valid here, we adopt an approach based on ravikumar2009sparse. Let be an orthonormal basis of such that for each . Then any can be written uniquely


As long as the coefficients decay sufficiently fast, can be well-approximated by the finite series . Similar claims are true for one-dimensional Sobolev functions, which applies to both additive (i.e. for ) and index (i.e. for ) models.

We illustrate here an application with additive models and one-dimensional expansions. It is straightforward to extend these ideas to more general models using a tensor product basis, though this quickly becomes computationally infeasible. For more on high-dimensional orthogonal series, see

lee2016spectral. Thus,


Given integers and assuming is sufficiently smooth, we have (efromovich2008nonparametric), so that the overall approximation error is on the order . Furthermore, for all . Since we are discarding terms for , in practice it suffices to check that for , or .

Letting denote the parameters for all , it thus suffices to define for the purposes of checking acyclicity. Let be the matrix . To estimate the coefficients , we solve


This is similar to ravikumar2009sparse with the addition of an explicit constraint. The first regularization term in (15) can be interpreted as a functional -norm.

4.3 Solving the continuous program

Having converted and to their finite-dimensional counterparts, we are now ready to solve (5) by applying standard optimization techniques. We emphasize that the hard work went into formulating the programs (12) and (15) as generic problems for which off-the-shelf solvers can be used. Importantly, since in both (12) and (15) the term is differentiable , the optimization program is an -penalized smooth minimization under a differentiable equality constraint. As in zheng2018dags, the standard machinery of augmented Lagrangian can be applied, resulting in a series of unconstrained problems:


where is the penalty parameter and is the dual variable.

A number of optimization algorithms can be applied to the above unconstrained -penalized smooth minimization problem. A natural choice is the L-BFGS-B algorithm (byrd1995limited), which can be directly applied by casting (16) into a box-constrained form:


where is a vector of all ones. We note that as in zheng2018dags, (16) is a nonconvex program, and at best can be solved to stationarity. Our experiments indicate that this nonetheless leads to competitive and often superior performance in practice.

5 Experiments

We study the empirical performance of two instances of the general framework: MLP (4.1) and Sobolev basis expansion (4.2), denoted as and respectively. For

we use an MLP with one hidden layer with 10 hidden units and sigmoid activation function. For

we use Sobolev basis , for .


For comparison, the following methods are chosen as baselines:

To summarize, and are specialized at linear models, whereas and targets general nonlinear dependencies. Comparisons with other score-based methods (KGV score (bach2003learning), Spearman correlation (sokolova2014causal)) and constraint-based methods (PC (spirtes2000), MM-MB (aliferis2010local)) can be found from previous works such as huang2018generalized, hence we omit in this paper. For all experiments, default parameter settings are used.


The ground truth DAG is generated from two random graph models: Erdos-Renyi (ER) and scale-free (SF). We use ER2 to denote an ER graph with edges, likewise for SF. Given the ground truth DAG, we simulate the SEM for all in topological order. To evaluate the performance under different data generation mechanisms, we consider four models for the :

  • Additive GP: , where each is a draw from Gaussian process with RBF kernel with length-scale one.

  • Index model: , where , , , and each is drawn uniformly from range .

  • MLP: is a randomly initialized MLP with one hidden layer of size 100 and sigmoid activation.

  • GP: is a draw from Gaussian process with RBF kernel with length-scale one.

In all settings, is standard Gaussian noise.


We evaluate the estimated DAG structure using the following common metrics: false discovery rate (FDR), true positive rate (TPR), false positive rate (FPR), and structural Hamming distance (SHD). Note that both and return a CPDAG that may contain undirected edges, in which case we evaluate them favorably by assuming correct orientation for undirected edges whenever possible.

5.1 Structure learning

Figure 1: Structure recovery measured by SHD (lower is better) to ground truth. Left: . Right: . Rows: random graph model (Erdos-Renyi and scale free). Columns: different types of SEM. performs well on a wide range of settings, while shows good accuracy on additive models.

In this experiment we examine the structure recovery of different methods by comparing the DAG estimates against the ground truth. We simulate {ER1, ER2, ER4, SF1, SF2, SF4} graphs with nodes. For each graph,

data samples are generated. The above process is repeated 10 times and we report the mean and standard deviations of the results. For

and , are used for respectively.

Figure 1 shows the resulting SHD in various settings; the complete set of results for the remaining metrics are deferred to the supplement. Overall, the proposed method attains the best SHD (lower the better) across a wide range of settings, particularly when the data generating mechanism is an MLP or an index model. One can also observe that the performance of stays stable for different graph types with varying density and degree distribution, as it does not make explicit assumptions on the topological properties of the graph such as density or degree distribution. Not surprisingly, performs well when the underlying SEM is additive GP. On the other hand, when the ground truth is not an additive model, the performance of degrades as expected. Finally, we observe that outperforms and on GP, which is a nonparametric setting in which a kernel-based dependency measure can excel, however, we note that the kernel-based approach accompanies an time complexity, compared to linear dependency on in and . For instance, the average runtime of on ER2 with , is over 90 minutes, whereas takes less than five minutes on average. This is consistent with the observation from Table 1, which contains runtime comparison of different algorithms on ER2 graph with samples. Also, with by properly tuning the regularization parameter, the performance of

for each individual setting can be improved considerably, for example in the GP setting. Since such hyperparameter tuning is not the main focus of this paper, we fix a reasonable

for all settings (see Appendix A for more discussion).

92.12 22.51 62.90 16.83 0.55 0.43 10.95 4.52 498.32 43.72 1547.42 109.83
282.64 67.46 321.88 57.33 0.59 0.17 43.15 12.43 706.35 64.49 6379.98 359.67
Table 1: Runtime (in seconds) of various algorithms on ER2 graph with samples.

5.2 Sensitivity to number of hidden units

Figure 2: SHD (lower is better) with varying hidden layer size in .

We also investigated the effect of number of hidden units in the estimate. It is well-known that as the size of the hidden layer increases, the functions representable by an MLP become more flexible. On the other hand, larger networks require more samples to estimate the parameters. Indeed, Figure 2 confirms this intuition. We plot the SHD with varying number of hidden units ranging from zero (linear function) to 100 units, using and samples generated from the additive GP model on SF2 graph with

nodes. One can first observe a sharp phase transition between zero and very few hidden units, which suggests the power of nonlinearity. Moreover, as the number of hidden units increases to 20, the performance for both

and steadily improves, in which case the increased flexibility brings benefit. However, as we further increase the number of hidden units, while SHD for remains similar, the SHD for deteriorates, hinting at the lack of samples to take advantage of the increased flexibility.

5.3 Real data

Finally, we evaluated on a real dataset from sachs2005causal that is commonly used as a benchmark as it comes with a consensus network that is accepted by the biological community. The dataset consists of continuous measurements of expression levels of proteins and phospholipids in human immune system cells for cell types. We report an SHD of 16 with 13 edges estimated by . In comparison, NOTEARS predicts 16 edges with SHD of 22 and predicts 18 edges that attains SHD of 19. (Due to the large number of samples, we could not run on this dataset.) Among the 13 edges predicted by , 7 edges agree with the consensus network: raf mek, mek erk, PLCg PIP2, PIP3 PLCg, PIP3 PIP2, PKC mek, PKC jnk; and 3 edges are predicted but in a reversed direction: raf PKC, akt erk, p38 PKC. Among the true positives, 3 edges are not found by other methods: mek erk, PIP3 PLCg, PKC mek.

6 Discussion

We present a framework for score-based learning of sparse directed acyclic graphical models that subsumes many popular parametric, semiparametric, and nonparametric models as special cases. The key technical device is a notion of nonparametric acyclicity that leverages partial derivatives in the algebraic characterization of DAGs. With a suitable choice of the approximation family, the estimation problem becomes a finite-dimensional differentiable program that can be solved by standard optimization algorithms. An interesting direction for future work is to study the nonconvex landscape of (16) more carefully in order to provide rigourous guarantees on the optimality of the solutions found by our approach.


Appendix A Additional results

We show FDR, TPR, FPR results in Figure 3, 4, 5 respectively. As in Figure 1, each row is a random graph model, each column is a type of SEM, and left and right block correspond to and . Overall has low FDR/FPR and high TPR, and same for on additive GP. Also observe that in most settings has low FDR as well as low TPR, which is a consequence of only predicting a small number of edges.

Comments on hyperparameter tuning

The experiments presented in this paper were conducted under a fixed (and therefore suboptimal) value of and weight threshold across all graph types, sparsity levels, and SEM types, despite the fact that each configuration may prefer different regularization strengths. Indeed, we observe substantially improved performance by choosing different values of hyperparameters in some settings. As our focus is not on attaining the best possible accuracy in all settings by carefully tuning the hyperparameters, we omit these results in the main text and only include here as a supplement. For instance, for ER4 graph with variables and samples, when the SEM is additive GP and MLP, setting and threshold = 0.5 gives results summarized in Table 2.

SEM Method SHD FDR TPR FPR Predicted #
Additive-GP 124.3 6.65 0.30 0.07 0.35 0.04 0.04 0.01 81.70 10.49
121.3 5.02 0.36 0.05 0.28 0.03 0.04 0.00 69.30 5.01
MLP 88.40 11.29 0.18 0.08 0.57 0.06 0.03 0.02 111.70 15.97
121.60 11.95 0.33 0.09 0.33 0.06 0.04 0.01 77.10 7.13
Table 2: ER4, , with and threshold = 0.5.
Figure 3: Structure recovery measured by FDR (lower is better) to ground truth.
Figure 4: Structure recovery measured by TPR (higher is better) to ground truth.
Figure 5: Structure recovery measured by FPR (lower is better) to ground truth.