DeepAI
Log In Sign Up

On the Sparse DAG Structure Learning Based on Adaptive Lasso

Learning the underlying casual structure, represented by Directed Acyclic Graphs (DAGs), of concerned events from fully-observational data is a crucial part of causal reasoning, but it is challenging due to the combinatorial and large search space. A recent flurry of developments recast this combinatorial problem into a continuous optimization problem by leveraging an algebraic equality characterization of acyclicity. However, these methods suffer from the fixed-threshold step after optimization, which is not a flexible and systematic way to rule out the cycle-inducing edges or false discoveries edges with small values caused by numerical precision. In this paper, we develop a data-driven DAG structure learning method without the predefined threshold, called adaptive NOTEARS [30], achieved by applying adaptive penalty levels to each parameters in the regularization term. We show that adaptive NOTEARS enjoys the oracle properties under some specific conditions. Furthermore, simulation experimental results validate the effectiveness of our method, without setting any gap of edges weights around zero.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

06/02/2022

Causal Structure Learning: a Combinatorial Perspective

In this review, we discuss approaches for learning causal structure from...
07/22/2021

Efficient Neural Causal Discovery without Acyclicity Constraints

Learning the structure of a causal graphical model using both observatio...
01/26/2023

Causal Structural Learning from Time Series: A Convex Optimization Approach

Structural learning, which aims to learn directed acyclic graphs (DAGs) ...
05/27/2022

Fast Causal Orientation Learning in Directed Acyclic Graphs

Causal relationships among a set of variables are commonly represented b...
11/23/2020

On the Convergence of Continuous Constrained Optimization for Structure Learning

Structure learning of directed acyclic graphs (DAGs) is a fundamental pr...
09/11/2017

Budgeted Experiment Design for Causal Structure Learning

We study the problem of causal structure learning when the experimenter ...

1 Introduction

Causal structure learning has become a burgeoning topic in machine learning community 

Xu et al. (2020); Vowels et al. (2021); Glymour et al. (2019), regarding discovering cause-effect relationship in multivariable system as the goal. Meanwhile, some hard problems, e.g., generalization, have arisen and suffered from the lack of causal relationship in previous machine learning studies. Recently, researchers in this community have begun to think about the links, which could and should be established between these two research fields Schölkopf (2022).

Bayesian Networks (BNs), proposed by Judea Pearl Pearl (1988)

, have become increasingly popular as a fundamental tool in causal reasoning. BN is a kind of probabilistic graph model that leverages both the probability theory and the graph theory. The benefit of this tool is that it provides a mechanism such that, under some assumptions, we can answer the queries ”how?”, ”why?” and ”what if?” 

Pearl and Mackenzie (2018). These three ladders form a progressive structure which is consistent with human reasoning. By spanning all the three ladders, we can determine the existence of causal relationship. Thus, by making a causal assumption, which means AB denotes A is a direct cause of B, causal relationship can be described by a BN, and causal structure learning is converted to learning the graphical structure of a BN, which is usually a Directed Acyclic Graph (DAG).

In general, there are mainly two classes of graphical structure learning methods, one is constraint-based algorithms, the other is score-based algorithms. For constraint-based algorithms, under causal faithfulness and causal sufficiency assumptions Peters et al. (2017), the output of these algorithms is a set of DAGs (equivalence class) by utilizing a sequence of conditional independence tests. Furthermore, constraint-based algorithms can be divided into two groups in terms of whether they are global or local, such as SGS Spirtes et al. (1989), PC Spirtes and Glymour (1991), and GS Margaritis and Thrun (1999). For score-based methods, the main idea is to establish a score function as the objective function, and then find the maximum in the solution space. Common score functions include BGe score Geiger and Heckerman (1994), BDe score Heckerman et al. (1995), MDL Bouckaert (1993) and BIC Geiger and Heckerman (1994). Based on the idea of order search and greedy search, some approaches that optimize the score functions over the DAGs space are proposed, such as OBS Teyssier and Koller (2012) and FGES Ramsey et al. (2017).

However, since the search space of DAGs is discrete, learning a DAG is still a NP-hard problem Chickering et al. (2004). Furthermore, considering computation matter, many standard numerical algorithms cannot be utilized here because of the acyclic constraint, and searching the optimal DAG in DAGs space scales superexponentialially by the amount of variables.

To address the aforementioned issue, Zheng et al. (2018)

proposed a new approach, named NOTEARS, which seeks to replace the traditional combinatorial constraint by a smooth function whose value is set to be zero. By this way, the combinatorial optimization problem is converted into a continuous optimization such that a wide range of numerical methods can be applied. As NOTEARS only considers the linear model, some works 

Kalainathan et al. (2018); Zheng et al. (2020)

extend it to non-linear or non-parametric situations by leveraging neural networks or orthogonal series. Further related works include GOLEM 

Ng et al. (2020), a likelihood-based method, which suggests using likelihood as learning objective with soft constraints.

It is worth noting that NOTEARS and most of its following works Zheng et al. (2020); Zhu et al. (2020); Lachapelle et al. (2020); Yu et al. (2019); Ng et al. (2022, 2020)

utilize hard thresholding to post-process the initial estimators. The fixed thresholding strategy is a direct method to cut down false positive estimates caused by numerical error. See Figure 

1

, we can observe that without thresholding step, even with large sample size, the false positive edge (the top right one) always exists and cannot be ruled out by regularization. What’s more, as penalty level going larger, the bias of the estimates becomes larger and larger and it ends up with a pure zero matrix. This implies that, in NOTEARS, purely increasing penalty term is not enough to fix this small-false-positive issue and extra procedures are needed to obtain a sparse solution.

Figure 1: Visualization of true (left hand side of legend) and estimated weighted adjacency matrices (right hand side of legend) of a 2-node DAG with n=1000 and penalty level . All the solutions are obtained by NOTEARS without thresholding.

On the other hand, hard-threshold strategy is not a flexible and systematic way to reduce the number of false discoveries. Firstly, choosing a fixed and sub-optimal value for thresholding cannot adapt to different types of graphs. Secondly, via adopting hard-thresholding strategy, although the amount of false positive edges is diminished, small coefficients in the true model may also be deleted indiscriminately (see Section 4.2).

In this work, we investigate whether the fixed-threshold strategy performs well on estimating the coefficients near zero. Motivated by this, we develop a structure learning method, based on adaptive lasso, without fixed-threshold. As we will show, the main contributions are as follows.

  • We propose a purely data-driven method that does not need a predefined threshold and illustrate the specific algorithm.

  • We study the oracle properties of our method possesses, including sparsity and asymptotic normality, which means that our method has the ability to select the right subset model and converge to the true model with an optimal rate when the sample size increases.

  • We consider using a more sufficient and suitable validation set in cross-validation for our goal to find the optimal subset model.

  • We demonstrate that the effectiveness of our method through simulation experiments with weights generated from a broader range, e.g. Gaussian distributed weights with mean zero.

The rest parts of this paper are organised as follows: in section 2, we review some basic concepts of DAG model and the specific algorithm of NOTEARS as well as the limitation of its fixed-threshold strategy. Then, based on the idea of adaptive Lasso (Zou, 2006b), we introduce our method and study its asymptotic behaviour when the sample size goes to infinity in Section 3. In Section 4, to verify our theoretical study and the effectiveness of our method, we compare our method with NOTEARS and existing state-of-the-arts by simulation experiments. Finally, we conclude our work in Section 5.

2 Background

In this section, after briefly reviewing some fundamental concepts of linear DAG model and score function in Section 2.1, we detail the NOTEARS method and corresponding optimization algorithm in Section 2.2. We then introduce how to extend NOTEARS to generalized linear models in Section 2.3.

2.1 Linear DAG Model

The DAG can be encoded as , where refers to a set of nodes and is a set of directed edges. The nodes of DAGs represent variables under consideration, and directed edges, which correspond to the causal relations between parent nodes and child nodes. For example, , represents is a directed cause of , where and are the parent and the child nodes, respectively. Figure 2 illustrates three possible three-node DAGs with two edges.

Figure 2: No circle in these three graphs. Both and are parent nodes of (left); is parent node of and (middle); is parent of and child of .

The task of DAG learning problem is to seek a DAG from DAGs space of nodes (discrete) that fits the observational data best, where consisting of

-dimensional vectors. We assume these vectors are observations of

identically independently distributed (iid) random vectors with joint distribution

, where . Importantly, this joint distribution can be decomposed as follows:

(1)

where represents the parental set of . Then, we model via structure equation models (SEM), denoted as , where refers to the noise and uncovered variables.

In this paper, we focus on linear DAG additive noise models (ANMs). By introducing the concept of weighted adjacency matrix , we can establish the SEM as , where N is a -dimensional random vector, denoted as . Each is the noise term added to . In this paper, we assume are iid, but not necessarily follow gaussian distribution.

The information about the structure of a DAG is completely involved in , which is defined in the following way:

Hence, our target is to derive

via minimizing least-square loss function

over the search space . Since we want to find a sparse solution, a regularization term is added. As a consequence, the score function is

(2)

2.2 The NOTEARS Method with Fixed-threshold

Zheng et al. (2018) proposed NOTEARS method which seeks to use a function satisfies these four desiderata: i) is a necessary and sufficient condition for to be a DAG; ii) its value reflects the level of acyclicity of the graph; iii) smooth; iv) its first derivative exists and easy to compute. Based on Harary and Manvel (1971), which starts to use matrix algebra to investigate the number of circles in a graph, NOTEARS verifies that is a proper function to characterize the acyclicity with derivative .

After determining the smooth equality constraint, the problem can be written as the following optimization program:

(3)

In order to solve the program (3), this constrained optimization program is converted to a sequence of unconstrained problems by augmented Lagrangian method Bertsekas (1997). The procedure is shown as follows.

(4)

For each iteration, the first step can be solved by the L-BFGS proximal quasi-Newton methodByrd et al. (1995); Zhong et al. (2014) with derivative

(5)

To overcome the problem caused by the non-smooth point zero in penalty term, i.e, , original weighted adjacency matrix is casted into the difference between two matrices with non-negative elements Zheng et al. (2019).

After obtaining the initial estimate, in order to reduce the number of false positive edges, zero out small coefficients if their absolute values are less than the pre-setting threshold. On the other hand, this strategy leads to more missing edges with small weights and is hard to adopt to different scenarios. For example, if the weights of a DAG are generated from uniform distribution over

, by setting , the small weights, which are located in the interval , have a high risk to be forced to 0’s in the post-processing step. Since the simulation experiments in previous work commonly set a gap to zero, for instance, in the simulation part of Zheng et al. (2018), the weights of edges commonly sampled uniformly from , where is weight scale, this side effect is not obvious.

The specific algorithm of NOTEARS with fixed threshold is presented as follows.

  
  Step 0. Take , initial value , , and some . Set threshold , .
  Step k. (k1) Find the smallest non-negative integer such that with
where
Compute .
  
  Set if converge.
  Final Post-processing step. Set , then return .
Algorithm 1 NOTEARS with fixed threshold.

2.3 Extension to Generalized Linear Models

Zheng et al.Zheng et al. (2020) extend the idea of NOTEARS to generalized linear models (GLMs). Traditionally, a GLM is formulated as follows:

(6)

where is a type of known link functions used in GLMs, and .

To extent the DAG constraint beyond the linear setting, the key point is to find a sensible way to re-define the weighted adjacency matrix since there is no explicit form of in GLMs setting anymore.

Based on the idea of Rosasco et al. (2013), which encodes the importance of each variable considered via utilizing corresponding partial derivatives, it is then easy to show Zheng et al. (2020) that the dependence among variables can be precisely measured by the -norm of corresponding partial derivatives. This means the weighted adjacency matrix in GLMs can be defined as follows:

(7)

Thus, for linear mean functions, i.e. , is equivalent to . We can simplify the DAG constraint by plugging in .

To construct the score function for GLMs, we need a more suitable loss function here. In this paper, we focus on one specific model in GLMs, which is logistic regression model, where

. For logistic regression, least-square loss is not appropriate anymore since it will result a non-convex optimal function. Here we introduce a concept Log Loss, which is commonly used as the loss function for logistic regression, defined as follows:

(8)

where .

3 The proposed method

In this section, we extend the adaptive LASSO method for causal structure learning with the hard DAG constraint. We call our method as adaptive NOTEARS. After detailing our method in Section 3.1, we study the asymptotic behaviour of adaptive NOTEARS theoretically in Section 3.2. Then in Section 3.3, we develop an algorithm to implement this method. Finally, we illustrate the way we deal with the hyper-parameters in our model in Section 3.4.

3.1 Method

From Figure 1, we find that when penalty level is relatively low, the false learned edges cannot be zero out. On the other hand, assigning a larger weight to regularization term may lead to even worse result. This means that Lasso put much more penalty on large coefficients rather than false positive ones, which is not consistent with our requirement.

Inspired by this, to improve the criterion (3) on variable selection, one intuition is applying adaptive penalty levels to different coefficients, which has been proposed by Zou (2006b). We consider to modify (3) as

(9)

where ’s are pre-specified values. Since we expect the minor false positive edges can be shrunk to zeros and at the same time remain the true edges, we would like the corresponding ’s to be large for the former and small for the latter, hence the minor false edges are heavily penalized and the true positive edges are lightly penalized. More detail about is given in Section3.2. And by this way, no extra predefined threshold is needed anymore.

3.2 Asymptotic Oracle Properties

In this section, we show that with a proper choice of , under certain conditions, our method possesses the oracle properties when the sample number goes to infinity.

Similar to procedure (4), via augmented Lagrangian method, the above ECP (9) can be solved by a dual ascent process. The first step of each iteration can be regarded as finding the local minimizer of the Langrangian with a fixed Lagrange multiplier , that is,

(10)

Let denotes the underlying true graph weights. And then we define

which means that contains the indices for terms whose true parameters are nonzero, and contains the indices for terms that do not exist in the underlying true model.

Here, we first assume two conditions:

  1. , where are identically independent distributed with mean

    and variance

    .

  2. is finite and converges to a positive definite matrix .

Under these two assumptions, the oracle properties of adaptive NOTEARS are described in the following theorems. Rigorous proofs are given in the Appendix.

Theorem 1 (Local Minimizer) Under above two conditions, if , the ’s associated with nonzero coefficients in , are related to the sample number , i.e. , and for some , then there exists a local minimizer of , which is -consistent to .

Let denotes the local minimizer in Theorem 1, which satisfies . Consistency is a nice start, and this motivates us to study more on its asymptotic properties as goes to infinity.

Before introducing the oracle properties, we need to define more notations because now we want to analyze the estimators of parameters which are nonzero in the underlying true model. We define

In condition 1), we already assume as . Furthermore, without loss of generality, we assume . Let

where is a matrix.

Theorem 2 (Oracle Properties) Under the two conditions, if , the ’s associated with zero coefficients in , are related to the sample number , i.e. , and as , then the local minimizer in Theorem 1 must satisfy the following:

  • (Sparsity)

  • (Asymptotic normality) For , as .

Firstly, for the zero coefficients part, Theorem 2 shows that given for some and

, adaptive NOTEARS can consistently remove all irrelevant variables with probability tending to 1. Secondly, for the nonzero estimators, magnifying the difference by

, we can see that the pattern turns out to be a normal distribution.

Then, there is only one question left. That is how to determine ’s such that the nonzero related part for some and zero related part . Here, following ZouZou (2006a), we can compute

using ordinary least square (OLS) estimates with some specific power

so that the above two requirements are met.

(11)

3.3 Algorithm

Basically, the algorithm can be divided into two parts, the first part is to find the solution for (11), and the second part is plugging in the adaptive penalty parameters to (10) and solve it. Both of the two equality-constrained program (ECP)’s are solved by augmented Lagrangian method, i.e. using Lagrangian method to handle the hard DAG constraint . The first part just follows the same procedure of original NOTEARS without thresholding step, and then, we obtain a matrix with element . Let , where is the Hadamard product. Thus, we have . In matrix notation, it is , where refers to Hadamard division. Then, ECP(4) can be transformed as

(12)

where is the final estimate for .

Then, similar to (4), via augmented Lagrangian method, the above ECP (12) can be solved by a dual ascent process.

(13)

To complete above optimization, we need to derive the ”new” form of gradient for the first step in (13), here we also reformulate the original as the difference between two matrices with positive elements. For the Loss function:

For the acyclicity term :

To sum up, the algorithm proceeds as follows.

  
  Step 0. Take , initial value , , and some . Set threshold , .
  OLS loop k. (k1) Using the complete data set , find the smallest non-negative integer such that with
where
Compute .
  
  Set if converge. Then, define , .
  
  Adaptive lasso loop t. (t1) Using the complete data set , for all , find the smallest non-negative integer such that with
where
Compute .
  Set if converge.
  Return the adaptive estimate .
Algorithm 2 NOTEARS with adaptive lasso.

3.4 Hyper-parameter

There is a tiny issue of choosing the tuning parameter . We found that the result of our algorithm is sensitive to the value of hyper-parameter, thus it is essential to choose a good one.

From NOTEARS with adaptive lasso, we obtain a set of candidate models , where . Similar to Feng and YuFeng and Yu (2013), our goal is to choose the optimal model contains the true variables as many as possible. The CV procedure we use shows as follows:

  
  Step 0. Divide the data into validation set and train set , and , , .
  For model (n=1,2,…,N). Using the validation set , compute the solution , where
  
  Evaluate the prediction performance of on the validation set by loss function .
Algorithm 3 Cross-validation.

Another key modification is that we need a more sufficient validation set to find the best model, especially when the candidate model set is large Feng and Yu (2013). Inspired by this, instead of using the traditional K-fold cross validation method, whose validation set is only of data, we suggest to swap the proportions of validation set and training set.

4 Experiments

We conduct experiments with large sample size to verify our asymptotic oracle properties (Section 3.2). To validate the effectiveness of our proposed method, we compare it against NOTEARS with fixed threshold, greedy equivalent search (GES) Chickering (2002); Ramsey et al. (2017) and the PC algorithm Spirtes et al. (2000). For NOTEARS, the value of threshold is and set the penalty level in our experiments. This setting can not only promise a DAG result but also remain a relatively good performance.

The basic set-up of our experiments is as follows. For each experiment, the ground-truth DAG is generated from the random graph model: Erdös-Rényi (ER). We use ER2 to denote an ER graph with edges, similar meaning for ER1 and ER4. Based the ground truth DAG, we generate the weights of existing edges from ) normal distribution with mean and variance ; 2) uniform distribution over interval , where . And then we simulate the sample via , and the noise . Here, we mainly consider linear case and extend to one type of GLMs, logistic regression model.

The metrics we use to evaluate the estimated DAGs include structural Hamming distance (SHD), true positive rate (TPR) and false discovery rate (FDR).

4.1 Numerical Results: Gaussian Weights

We first perform the structure learning study in Gaussian-weight cases, where the weights of existing edges are generated from normal distribution with mean 0 and variance . In our experiments we generate {ER1, ER2, ER4} graphs with different numbers of nodes , 15 graphs for each type. And then we simulate samples with iid Gaussian noises. The result is shown in Figure 3. For better visualization, only the mean values of metrics are reported here.

Consistent with previous work on GES, it shows significant advantage in ER1 case, no matter regarding which metric we evaluate. But its preformance deteriorates rapidly when the level of sparsity decreases. Also, as we can see from Figure 3, for ER4 with 50 nodes, the algorithm of GES that we use cannot get an output, and the reason may need further discussion. Not surprisingly, the performance of NOTEARS and adaptive NOTEARS show a positive relationship. This is natural since adaptive NOTEARS extracts the weights of penalty level from conventional NOTEARS, which means the performance of adaptive NOTEARs is sensitive to that of the conventional one. Nonetheless, our approach performs uniformly better than NOTEARS across different settings of graphs.

Figure 3:

Numerical results in terms of SHD, TPR and FDR on ER graphs, with sample size n = 1000 and gaussian noise. For TPR, higher is better, for the other,three, lower is better. Columns: random graph types, ER-k = Erd¨os-R´enyi, scale-free graphs with kd expected edges. SinCe each sub-graph has four lines, for better visualization no standard error is reported here.

4.2 Coefficients in the Neighborhood of 0

Compare to original weight-generating interval of NOTEARS, which is , there are two points we consider to improve in our method. The first is the coefficients near 0, which cannot be learned by fixed-thresholding post-processing step. The second is about the upper and lower bound, which are set to be in . We expect our method can extend NOTEARS to a wider range with a better estimate. Hence, we investigate the value of missing edges in both methods estimate under more general settings. We generated 200 DAG structures from ER1 model with 10 nodes. And simulate the weights from , , and respectively. For each experiment, 1000 samples are simulated. Based on the different methodologies of these two methods, one with hard threshold 0.1 and the other with adaptive penalty levels, our method, the latter one, is more flexible to handle the wider ranges. Indeed, Figure 4 confirms this intuition. We plot the histogram of missing edges. One can first observe in both three cases, our method (blue bar) loses less edges than NOTEARS with hard-thresholding (yellow bar), especially the area around zero. Moreover, as the range becomes wider or the variance goes larger, the gap between these two methods becomes more and more significant.

Figure 4: The histogram of missing edges. Blue bar refers to NOTEARS with hard-threshold 0.1; yellow bar refers to NOTEARS with adaptive penalty levels. In the first row, the coefficients of ground truth graph are generated from uniform distribution and from left to right; in the second row, the real coefficients are from normal distribution with variance and .

4.3 High-dimensional Cases

To investigate the performance of our method for learning high-dimensional graphical models, we increase the number of nodes to 100, and compare the results to those of fixed-thresholding NOTEARS. Here, to illustrate the vadility of the evaluation criteria, we draw error bars with sample standard deviations. The result is shown in Figure

5. We can find under nearly the same value of SHD, the true positive rate of adaptive NOTEARS is significantly higher than that of NOTEARS with fixed-threshold. Besides, in terms of FDR, which lower is better, adaptive NOTEARS has lower values for all types of graphs. In general, adaptive NOTEARS consistently outperforms the original fixed-threshold one even in high-dimensional settings.

Figure 5: Numerical results in terms of SHD, TPR and FDR on ER2 graphs with Gauss(0,4) weights, with sample size n=1000 and iid gaussian noise.

4.4 Generalized Linear Models

For GLMs, we concentrate on logistic regression for dichotomous data , with link function . As we introduced in Section 2.3, for linear mean functions, the elements of in DAG constraint can be substituted by ’s, the coefficients in linear mean functions. We compare the performance of fixed-thresholding NOTEARS to adaptive NOTEARS with edge weights follow .

Figure 6: Numerical results for logistic regression in terms of SHD, TPR and FDR on ER graphs, with sample size n = 1000. For TPR, higher is better, for the other two metrics, lower is better. Columns: random graph types, ER-k = Erd¨os-R´enyi, scale-free graphs with kd expected edges. Since each sub-graph has four lines, for better visualization no standard error is reported here.

Figure 6 shows the structure recovery results for . By comparison, We can find the true positive rate of adaptive NOTEARS is significantly than that of NOTEARS with fixed-threshold adaptive NOTEARS, and at the same time, in terms of the other two metrics, i.e. SHD and FDR, both methods keep at the similar level. In general, adaptive NOTEARS consistently outperforms the original NOTEARS with 0.1 threshold.

4.5 Real Data

Finally, consistent with Zheng et al. Zheng et al. (2018), we compare our method to NOTEARS on the same real dataset which records the expression levels of 11 phosphoproteins and phospholipids in human cells. This dataset is commonly used in the literature of bayesian network structure inference as it provides a network that is widely accepted by the biological community. Based on d = 11 cell types and n = 911 observational samples, the ground truth structure given by Sachs et al. [40] contains 17 edges. In our experiments, NOTEARS with 0.1 threshold estimates 22 edges with an SHD of 20, compared to 20 estimates for adaptive NOTEARS with an SHD of 19.

5 Discussion

Based on the methodology of NOTEARS, we present a method for score-based learning of DAG models that does not need a hard thresholding post-processing step. The key technical design is using adaptive penalty levels to each edge rather than a common one. Notably, we prove that our method enjoys the oracle properties with adaptive penalties under some mild conditions. With a suitable choice of hyper-parameter in the second step, where we apply a modified -fold cross validation to handle, our method can zero out edges automatically in each iteration. Our simulation experiments show the effectiveness of our method, and it is worth emphasizing that adaptive strategy in our method is generally more flexible than hard-threshold strategy since it is a purely data-driven procedure, which makes our method owns the ability to adapt to different types of graph.

References

  • Bertsekas (1997) Dimitri P Bertsekas. Nonlinear programming. Journal of the Operational Research Society, 48(3):334–334, 1997.
  • Bouckaert (1993) Remco R Bouckaert. Probabilistic network construction using the minimum description length principle. In European conference on symbolic and quantitative approaches to reasoning and uncertainty, pages 41–48. Springer, 1993.
  • Byrd et al. (1995) Richard H Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on scientific computing, 16(5):1190–1208, 1995.
  • Chickering (2002) David Maxwell Chickering. Optimal structure identification with greedy search. Journal of machine learning research, 3(Nov):507–554, 2002.
  • Chickering et al. (2004) Max Chickering, David Heckerman, and Chris Meek. Large-sample learning of bayesian networks is np-hard. Journal of Machine Learning Research, 5:1287–1330, 2004.
  • Choi et al. (2010) Nam Hee Choi, William Li, and Ji Zhu. Variable selection with the strong heredity constraint and its oracle property. Journal of the American Statistical Association, 105(489):354–364, 2010.
  • Feng and Yu (2013) Yang Feng and Yi Yu. The restricted consistency property of leave--out cross-validation for high-dimensional variable selection, 2013. URL https://arxiv.org/abs/1308.5390.
  • Geiger and Heckerman (1994) Dan Geiger and David Heckerman. Learning gaussian networks. In Uncertainty Proceedings 1994, pages 235–243. Elsevier, 1994.
  • Glymour et al. (2019) Clark Glymour, Kun Zhang, and Peter Spirtes. Review of causal discovery methods based on graphical models. Frontiers in genetics, 10:524, 2019.
  • Harary and Manvel (1971) Frank Harary and Bennet Manvel. On the number of cycles in a graph. Matematickỳ časopis, 21(1):55–63, 1971.
  • Heckerman et al. (1995) David Heckerman, Dan Geiger, and David M Chickering. Learning bayesian networks: The combination of knowledge and statistical data. Machine learning, 20(3):197–243, 1995.
  • Kalainathan et al. (2018) Diviyan Kalainathan, Olivier Goudet, Isabelle Guyon, David Lopez-Paz, and Michèle Sebag. Structural agnostic modeling: Adversarial learning of causal graphs. arXiv preprint arXiv:1803.04929, 2018.
  • Lachapelle et al. (2020) Sébastien Lachapelle, Philippe Brouillard, Tristan Deleu, and Simon Lacoste-Julien. Gradient-based neural dag learning. In International Conference on Learning Representations, 2020.
  • Margaritis and Thrun (1999) Dimitris Margaritis and Sebastian Thrun. Bayesian network induction via local neighborhoods. Advances in neural information processing systems, 12, 1999.
  • Ng et al. (2020) Ignavier Ng, AmirEmad Ghassami, and Kun Zhang. On the role of sparsity and dag constraints for learning linear dags. Advances in Neural Information Processing Systems, 33:17943–17954, 2020.
  • Ng et al. (2022) Ignavier Ng, Shengyu Zhu, Zhuangyan Fang, Haoyang Li, Zhitang Chen, and Jun Wang. Masked gradient-based causal structure learning. In SIAM International Conference on Data Mining, pages 424–432, 2022.
  • Pearl (1988) Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan kaufmann, 1988.
  • Pearl and Mackenzie (2018) Judea Pearl and Dana Mackenzie. The book of why: the new science of cause and effect. Basic books, 2018.
  • Peters et al. (2017) Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017.
  • Ramsey et al. (2017) Joseph Ramsey, Madelyn Glymour, Ruben Sanchez-Romero, and Clark Glymour. A million variables and more: the fast greedy equivalence search algorithm for learning high-dimensional graphical causal models, with an application to functional magnetic resonance images.

    International journal of data science and analytics

    , 3(2):121–129, 2017.
  • Rosasco et al. (2013) Lorenzo Andrea Rosasco, Silvia Villa, Sofia Mosci, Matteo Santoro, and Alessandro Verri. Nonparametric sparsity and regularization. 2013.
  • Schölkopf (2022) Bernhard Schölkopf. Causality for machine learning. In Probabilistic and Causal Inference: The Works of Judea Pearl, pages 765–804. 2022.
  • Spirtes and Glymour (1991) Peter Spirtes and Clark Glymour. An algorithm for fast recovery of sparse causal graphs. Social science computer review, 9(1):62–72, 1991.
  • Spirtes et al. (1989) Peter Spirtes, Clark Glymour, and Richard Scheines. Causality from probability. 1989.
  • Spirtes et al. (2000) Peter Spirtes, Clark N Glymour, Richard Scheines, and David Heckerman. Causation, prediction, and search. MIT press, 2000.
  • Teyssier and Koller (2012) Marc Teyssier and Daphne Koller. Ordering-based search: A simple and effective algorithm for learning bayesian networks. arXiv preprint arXiv:1207.1429, 2012.
  • Vowels et al. (2021) Matthew J Vowels, Necati Cihan Camgoz, and Richard Bowden. Targeted vae: Variational and targeted learning for causal inference. In 2021 IEEE International Conference on Smart Data Services (SMDS), pages 132–141. IEEE, 2021.
  • Xu et al. (2020) Guandong Xu, Tri Dung Duong, Qian Li, Shaowu Liu, and Xianzhi Wang. Causality learning: a new perspective for interpretable machine learning. arXiv preprint arXiv:2006.16789, 2020.
  • Yu et al. (2019) Yue Yu, Jie Chen, Tian Gao, and Mo Yu. DAG-GNN: DAG structure learning with graph neural networks. In International Conference on Machine Learning, 2019.
  • Zheng et al. (2018) Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. Dags with no tears: Continuous optimization for structure learning. Advances in Neural Information Processing Systems, 31, 2018.
  • Zheng et al. (2019) Xun Zheng, Chen Dan, Bryon Aragam, Pradeep Ravikumar, and Eric P. Xing. Learning sparse nonparametric dags, 2019. URL https://arxiv.org/abs/1909.13189.
  • Zheng et al. (2020) Xun Zheng, Chen Dan, Bryon Aragam, Pradeep Ravikumar, and Eric Xing. Learning sparse nonparametric dags. In

    International Conference on Artificial Intelligence and Statistics

    , pages 3414–3425. PMLR, 2020.
  • Zhong et al. (2014) Kai Zhong, Ian En-Hsu Yen, Inderjit S Dhillon, and Pradeep K Ravikumar. Proximal quasi-newton for computationally intensive l1-regularized m-estimators. Advances in Neural Information Processing Systems, 27, 2014.
  • Zhu et al. (2020) Shengyu Zhu, Ignavier Ng, and Zhitang Chen.

    Causal discovery with reinforcement learning.

    In International Conference on Learning Representations, 2020.
  • Zou (2006a) Hui Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418–1429, 2006a. ISSN 01621459. URL http://www.jstor.org/stable/27639762.
  • Zou (2006b) Hui Zou. The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476):1418–1429, 2006b.

Appendix A Proof of Theorems

a.1 Proof of Theorem 1

Let , , , , where is a constant.

where .

where we define

For ,

For ,

For ,

where .

For , since , , we can derive

.

Thus,

We can find when and , ii@ dominates i@ and iii@, then, .

Thus, we can conclude that there exists a local minimizer of , such that Choi et al. [2010].

a.2 Proof of Theorem 2

Firstly, for sparsity property, it is sufficient to prove ,

(14)

with probability tending to 1 where .

Then,

where we can find

Therefore, if satisfies , i.e. the speed approaching to 0 is slower than , where is a positive constant, then, the sign of is dominated by .

Secondly, for asymptotic normality property, we concentrate on set and using to denote the sub adjacency matrix of graphs.

Denote