# Penalized Likelihood Methods for Estimation of Sparse High Dimensional Directed Acyclic Graphs

Directed acyclic graphs (DAGs) are commonly used to represent causal relationships among random variables in graphical models. Applications of these models arise in the study of physical, as well as biological systems, where directed edges between nodes represent the influence of components of the system on each other. The general problem of estimating DAGs from observed data is computationally NP-hard, Moreover two directed graphs may be observationally equivalent. When the nodes exhibit a natural ordering, the problem of estimating directed graphs reduces to the problem of estimating the structure of the network. In this paper, we propose a penalized likelihood approach that directly estimates the adjacency matrix of DAGs. Both lasso and adaptive lasso penalties are considered and an efficient algorithm is proposed for estimation of high dimensional DAGs. We study variable selection consistency of the two penalties when the number of variables grows to infinity with the sample size. We show that although lasso can only consistently estimate the true network under stringent assumptions, adaptive lasso achieves this task under mild regularity conditions. The performance of the proposed methods is compared to alternative methods in simulated, as well as real, data examples.

## Authors

• 19 publications
• 27 publications
• ### Adaptive Penalized Estimation of Directed Acyclic Graphs From Categorical Data

We develop in this article a penalized likelihood method to estimate spa...
03/10/2014 ∙ by Jiaying Gu, et al. ∙ 0

• ### High-Dimensional Joint Estimation of Multiple Directed Gaussian Graphical Models

We consider the problem of jointly estimating multiple related directed ...
04/03/2018 ∙ by Yuhao Wang, et al. ∙ 0

• ### Efficient structure learning with automatic sparsity selection for causal graph processes

We propose a novel algorithm for efficiently computing a sparse directed...
06/11/2019 ∙ by Théophile Griveau-Billion, et al. ∙ 0

• ### Inference in high-dimensional graphical models

We provide a selected overview of methodology and theory for estimation ...
01/25/2018 ∙ by Jana Jankova, et al. ∙ 0

• ### A Pliable Lasso

We propose a generalization of the lasso that allows the model coefficie...
12/01/2017 ∙ by Robert Tibshirani, et al. ∙ 0

• ### Mixed Graphical Models for Causal Analysis of Multi-modal Variables

Graphical causal models are an important tool for knowledge discovery be...
04/09/2017 ∙ by Andrew J Sedgewick, et al. ∙ 0

• ### Learning Local Dependence In Ordered Data

In many applications, data come with a natural ordering. This ordering c...
04/25/2016 ∙ by Guo Yu, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Graphical models provide efficient tools for the study of statistical models through a compact representation of the joint probability distribution of the underlying random variables. The nodes of the graph represent the random variables, while the edges capture the relationships among them. Both directed and undirected edges are used to represent interactions among random variables. However, there is a conceptual difference between these two types of edges: while undirected edges are used to represent similarity or correlation, directed edges are usually interpreted as causal relationships. The study of directed edges is therefore directly related to the theory of causality, and of main interest in many applications. A special class of directed graphical models (also known as Bayesian Networks) are based on directed acyclic graphs (

dags), where all the edges of the graph are directed and there are no directed cycles present in the graph. dags are used in graphical models and belief networks and have been the focus of research in the computer science literature (see Pearl (2000)). Important applications involving dags also arise in the study of biological systems, as many cellular mechanisms are known to include causal relationships. Cell signalling pathways and gene regulatory networks are two examples, where causal relationships play an important role (Markowetz and Spang, 2007).

The problem of estimating dags is an NP-hard problem, and estimation of direction of edges may not be possible due to observational equivalence (see section 2). Most of the earlier methods for estimating dags correspond to greedy search algorithms that search through the space of possible dags. A number of methods are available for estimating the structure of dags for small to moderate number of nodes. The max-min hill climbing algorithm (Tsamardinos et al., 2006), and the PC-Algorithm (Spirtes et al., 2000) are two such examples. However, the space of possible dags grows super-exponentially with the number of variables (nodes), and estimation of dags using these methods, especially in a small , large setting, becomes impractical. Bayesian methods of estimating dags (e.g Heckerman et al., 1995) are also computationally very intensive and therefore not particularly appropriate for large graphs. Kalisch and Bühlmann (2007) recently proposed an implementation of the PC-Algorithm with polynomial complexity that can be used for estimation of high dimensional sparse dags. However, when the variables inherit a natural ordering, estimation of a dag is reduced to estimating its structure or skeleton. Applications with natural ordering of variables include estimation of causal relationships from temporal observations, or settings where additional experimental data can determine the ordering of variables, and estimation of transcriptional regulatory networks from gene expression data. Examples of these applications are presented in section 6.

The structure of the graph can be determined from conditional independence relations among random variables. For undirected graphs, this is equivalent to learning the structure of the conditional independence graph (cig), which in the case of Gaussian random variables, is determined by zeros in the inverse covariance matrix (also known as precision or concentration matrix). Different penalization methods, have been recently proposed to obtain sparse estimates of the concentration matrix. Meinshausen and Bühlmann (2006) considered an approximation to the problem of sparse inverse covariance estimation using the lasso penalty. They showed under a set of assumptions, that their proposed method correctly determines the neighborhood of each node. Banerjee et al. (2008) and Friedman et al. (2008b) explored different aspects of the problem of estimating the concentration matrix using the lasso penalty, while Yuan and Lin (2007) and Fan et al. (2007) considered other choices for the penalty. Rothman et al. (2008) proved consistency in Frobenius norm, as well as in matrix norm, of the -regularized estimate of the concentration matrix when , while Lam and Fan (2008) extended their result and considered estimation of matrices related to the precision matrix, including the Cholesky factor of the inverse covariance matrix, using general penalties. Penalization of the Cholesky factor of the inverse covariance matrix has been also considered by Huang et al. (2006), where they used the lasso penalty in order to obtain a sparse estimate of the inverse covariance matrix. This method is based on the regression interpretation of the Cholesky factorization model and therefore requires the variables to be ordered a priori.

In this paper, we consider the problem of estimating the skeleton of dags, where the variables exhibit a natural ordering. We use graph theoretic properties of dags and reformulate the likelihood as a function of the adjacency matrix of the graph. We then exploit the ordering of variables to propose an efficient algorithm for estimation of structure of dags, which offers considerable improvement in terms of computational complexity. Both lasso and adaptive lasso penalties are considered and variable selection consistency of estimators is established in the setting. In particular, we show that although lasso is only variable selection consistent under stringent conditions, adaptive lasso can consistently estimate the true dag under the usual regularity assumptions. We also present a data dependent choice of the tuning parameter that controls the probability of errors. Theoretical as well as empirical evidence shows that when the underlying causal mechanism in the network is linear, the proposed method can also be applied to non-Gaussian observations. Finally, additional simulations indicate that although the proposed method is derived based on the ordering of variables, the method is not sensitive to random permutations of the order of variables in high dimensional sparse settings.

## 2 Representation of Directed Acyclic Graphs

Consider a graph , where corresponds to the set of nodes with elements and to the edge set. The nodes of the graph represent random variables and the edges capture associations amongst them. An edge is called directed if and undirected if . The main focus of this paper is a special class of graphs where consists of only directed edges, and does not include directed cycles. We denote by the set of parents of node and for , we denote . The skeleton of a dag is the undirected graph that is obtained by replacing directed edges in with undirected ones. Finally, throughout this paper, we represent using the adjacency matrix of the graph; i.e. a matrix whose th entry indicates whether there is an edge (and possibly its weight) between nodes and .

The estimation of dags is a challenging problem due to the so-called observational equivalence of dags with respect to the same probability distribution. More specifically, regardless of the sample size, it may not be possible to infer the direction of causation among random variables from observational data. As an illustration of the observational equivalence, consider the simple dag in the right panel of Figure 1. Reversing the direction of all edges of the graph results in a new dag, which is the same as the original graph, except for changes in the node labels and is therefore polymorphic to the original one. It is therefore natural to estimate the equivalence class of dags corresponding to the same probability distribution starting with the skeleton of the network.

The second challenge in estimating dags is that conditional independence among random variables may not reveal the skeleton. The notion of conditional independence in dags is either represented using the concept of d-separation (Pearl, 2000) or the moral graph of a dag (Lauritzen, 1996). The moral graph is obtained by removing the directions of the graph and “marrying” the parents of each node. Therefore estimation of the conditional independence structure reveals the structure of the moral graph of the dag, which includes additional edges between parents of each node. This can also be illustrated using the simple graph in the left panel of Figure 1. Suppose

are normally distributed with covariance matrix

. The only zero elements of the inverse covariance matrix are , as and are connected in the moral graph of .

### 2.1 The Latent Variable Model

The causal effect of random variables in a directed acyclic graph can be explained using structural equation models, where each variable is modeled as a (nonlinear) function of its parents. The general form of these models is given by (Pearl, 2000):

 Xi=fi(pai,Zi),i=1,…,p (2.1)

The random variables are the latent variables representing the unexplained variation in each node. To model the association among nodes of a dag, we consider a simplification of (2.1) with being linear. More specifically, let represent the effect of node on for , then

 Xi=∑j∈paiρijXj+Zi,i=1,…,p (2.2)

In the special case where the random variables are Gaussian, equations (2.1) and (2.2) are equivalent in the sense that

are coefficients of the linear regression model of

’s on . It is known in the normal case that .

Consider the simple dag in the left panel of Figure 1; denoting the influence matrix of the graph by , (2.2) can be written in compact form as , where for the simple example above, we have

 Λ=⎛⎜⎝100ρ1210ρ12ρ23ρ231⎞⎟⎠

Let the latent variables be independent with mean

and variance

. Then and , where and denotes the transpose of the matrix .

The following result from Shojaie and Michailidis (2009a) establishes relationships between the influence matrix , and the adjacency matrix of the graph, . The second part of the lemma establishes a compact relationship between and in the case of dags, which is explored in section 3 to directly formulate the problem of estimating the skeleton of a dag.

###### Lemma 2.1.

For any graph ,

1. , where .

2. If is a dag, has full rank and .

###### Remark 2.2.

Part (ii) of Lemma 2.1 and the fact that imply that for any dag, if for all , then is full rank. More specifically, let denote the

th eigenvalue of matrix

. Then, (or ). Similarly, since , full rankness of implies that (or equivalently ). This result also applies to all subnetworks of a dag.

The properties of the proposed latent variable model established in Lemma 2.1 are independent of the choice of probability distribution . In fact, since the latent variables in (2.2) are assumed independent, given the entries of the adjacency matrix, the distribution of each random variable in the graph only depends on the values of

. Therefore, regardless of the choice of the probability distribution, the joint distribution of the random variables is compatible with

(see for example Pearl (2000) p. 16). In section 5

, we illustrate this result using data generated according to non-Gaussian distributions.

## 3 Penalized Estimation of dags

### 3.1 Problem Formulation

Consider the latent variable model of section 2.1 and denote by the data matrix. We assume, without loss of generality, that the ’s are centered and scaled, so that and . Note that the results in section 2.1 were established independent of the choice of the probability distribution. As mentioned before, under the normality assumption, the latent variable model is equivalent to the general structural equation model. Although we focus on Gaussian random variables in the remainder of this paper, the estimation procedure proposed in this section can be applied to a variety of other distributions, if one is willing to assume the linear structure in (2.2).

Denote by the precision matrix of a

-vector of Gaussian random variables and consider a general penalty function by

. The penalized likelihood function is then given by

 ^Ω=\operatornamewithlimitsargminΩ≻0{−\operatornamewithlimitslogdet(Ω)+\operatornamewithlimitstr(ΩS)+λJ(Ω)} (3.1)

where denotes the empirical covariance matrix and is the tuning parameter controlling the size of the penalty. Applications in biological and social networks often involve sparse networks. It is therefore desirable to find a sparse solution for (3.1). This becomes more important in the small , large setting, where the unpenalized solution includes many additional edges. The lasso penalty of Tibshirani (1996) and the adaptive lasso penalty proposed by Zou (2006) are singular at zero and therefore result in sparse solutions. We consider these two penalties in order to find a sparse estimate of the adjacency matrix. Other choices of the penalty function are briefly discussed in the conclusions section.

Using the latent variable model of section 2.1, and the relationship between the covariance matrix and the adjacency matrix of dags established in Lemma 2.1, the problem of estimating the adjacency matrix of the graph can be directly formulated as an optimization problem based on . As noted in Lemma 2.1, if the underlying graph is a dag and the ordering of the variables is known, then is a lower triangular matrix with zeros on the diagonal. Let . Then using the facts that and , can be estimated as the solution of the following optimization problem

 ^A=\operatornamewithlimitsargminA∈A{\operatornamewithlimitstr[(I−A)\sf\tiny T(I−A)S]+λJ(A)} (3.2)

In this paper, we consider the general weighted lasso problem, where

 J(A)=λ∑i,j=1:p,\vskip3.0ptplus1.0ptminus1.0ptj

Lasso and adaptive lasso problems are special cases of this general penalty. In the case of lasso, . The original weights in adaptive lasso, proposed by Zou (2006) are obtained by setting , for some initial estimate of the adjacency matrix and some power . To facilitate the study of asymptotic properties of adaptive lasso estimates we consider the following modification of the original weights

 wij=1∨|~Aij|−γ (3.4)

where the original estimates are obtained from the regular lasso estimates.

The objective function for both lasso and adaptive lasso problems is convex. However, since the penalty is non-differentiable, these problems can be reformulated using matrices and . To that end, let be the matrix of weights for adaptive lasso, or the matrix of ones for the lasso estimation problem. Problem (3.2) can then be formulated as:

 minA+,A−⪰0\operatornamewithlimitstr{S(I−A++A−)\sf\tiny T(I−A++A−)+λ(A++A−)W+Δ(A++A−)1l+} (3.5)

where is interpreted componentwise, is a large positive number and is the indicator matrix for lower triangular elements of a matrix, including the diagonal elements. The last term of the objective function () prevents the upper triangular elements of the matrices and to be nonzero.

Problem (3.5) is a quadratic optimization problem with non-negativity constraints and can be solved using standard interior point algorithms. However, such algorithms do not scale well with dimension and are only applicable if ranges in the hundreds. In section 3.2, we present an alternative formulation of the problem, which leads to considerably more efficient algorithms.

### 3.2 Optimization Algorithm

Consider again the problem of estimating the adjacency matrix of dags with either lasso or adaptive lasso penalties. Denoting the th row of matrix as we can write (3.2) as:

 ^A=\operatornamewithlimitsargminA∈A{p∑i=1Ai\sf\tiny TSAi−2AiSi\sf\tiny T+λWi\sf\tiny T|Ai|} (3.6)

It can be seen that the objective function in (3.6) is separable and therefore it suffices to solve the optimization problem over each row of matrix . Denote by the set of indices up to , i.e. .

Then, taking advantage of the lower triangular structure of , solving (3.6) is equivalent to solving the following optimization problems ()

 ^Ai,i−1––––=\operatornamewithlimitsargminθ∈Ri−1{θ\sf\tiny TSi−1––––,i−1––––θ−2Si,i−1––––θ+λi−1∑j=1|θj|wij},i=2,…,p (3.7)

Using the facts that and , the problem in (3.7) can be reformulated as following -regularized least squares problems

 ^Ai,i−1––––=\operatornamewithlimitsargminθ∈Ri−1{n−1∥Xn––,i−1––––θ−Xn––,i∥22+λii−1∑j=1|θj|wij},i=2,…,p (3.8)

The formulation in (3.8) indicates that the th row of matrix includes the coefficient of projecting on , which is in agreement with the discussion in section 2.1. It also reveals a connection between covariance selection methods and the neighborhood selection approach of Meinshausen and Bühlmann (2006); namely, when the underlying graph is a dag, the approximate solution of the neighborhood selection problem is exact, if the regression model is fitted on the set of parents of each node instead of all other nodes in the graph.

Using (3.8), the problem of estimating dags can be solved very efficiently. In fact, it suffices to solve lasso problems for estimation of least squares coefficients, with dimensions ranging from to . To solve these problems, we use the efficient pathwise coordinate optimization algorithm of Friedman et al. (2008a), implemented in the R-package glmnet. The proposed algorithm is summarized in Algorithm 1.

### 3.3 Analysis of Computational Complexity

In this section, we provide a comparison of the computational complexity of the algorithm proposed in section 3.2 and the PC-Algorithm. As mentioned in the introduction, the space of all possible dags is super-exponential in the number of nodes and hence it is not surprising that the PC-Algorithm, without any restriction on the space of dags has exponential complexity. Kalisch and Bühlmann (2007) propose an efficient implementation of the PC-Algorithm for sparse dags; its complexity where the maximal neighborhood size is small, is bounded with high probability by . Although this is a considerable improvement over other methods of estimating dags, in many applications it can become fairly expensive. For example, gene regulatory networks and signaling pathways exhibit a “hub” structure, which leads to large values for . The graphical lasso algorithm, proposed by Friedman et al. (2008b), uses an iterative algorithm for estimation of the inverse covariance matrix and has computational complexity .

The reformulation of the dag estimation problem in (3.8) requires solving lasso regression problems. The cost of solving a lasso problem comprised of covariates and observations using the pathwise coordinate optimization (shooting) algorithm of Friedman et al. (2008a) is ; hence, the total cost of estimating the adjacency matrix of the graph is , which is the same to the cost of calculating the (full) empirical covariance matrix . Moreover, the formulation in (3.8) includes a set of non-overlapping sub-problems. Therefore, for problems with very large number of nodes and/or observations, the performance of the algorithm can be further improved by parallelizing the estimation of these sub-problems. The adaptive lasso version of the problem is similarly solved using the modification of the regular lasso problem proposed in Zou (2006), which results in the same computational cost as the regular lasso problem.

Figure 2, compares the CPU time required for estimation of DAGs using both the PC-Algorithm, as well as our proposed algorithm for a range of values of and . To control the complexity of the PC algorithm, the average neighborhood size is set to 5 and the significance level for the PC-Algorithm, as well as the tuning parameter for lasso and adaptive lasso penalties are set according to the optimal values discussed in Section 5. The time reported for the PC-Algorithm is the CPU time required for estimation of the skeleton of the graph. The plot demonstrates the higher order of complexity of the PC-Algorithm, as well as the dependency of the algorithm on the sample size.

## 4 Asymptotic Properties

### 4.1 Preliminaries

We next establish theoretical properties for both lasso, as well as adaptive lasso estimators of the adjacency matrix of dags. Asymptotic properties of lasso-type estimates with fixed design matrices have been studied by a number of researchers (see e.g. Knight and Fu, 2000; Zou, 2006; Huang et al., 2008). Lasso estimates with random design matrices have been also considered by Meinshausen and Bühlmann (2006). On the other hand, Rothman et al. (2008) and Lam and Fan (2008) among others have studied asymptotic properties of estimates of covariance and precision matrices.

As discussed in section 3.1, the problem of estimating the adjacency matrix of a dag is equivalent to solving non-overlapping penalized least square problems described in (3.7

). In order to study the asymptotic properties of the proposed estimators, we focus on the asymptotic consistency of network estimation, i.e. the probability of correctly estimating the network structure, in terms of type I and type II errors. We allow the total number of nodes in the graph to grow as an arbitrary polynomial function of the sample size, while assuming that the true underlying network is sparse. The assumptions required for establishing asymptotic properties of lasso and adaptive lasso estimates of

dags are presented in section 4.2. We then study variable selection consistency of both lasso and adaptive lasso estimates in section 4.3. The choice of penalty parameter is studied in section 4.4. Technical proofs are given in the Appendix.

### 4.2 Assumptions

Let be a collection of zero-mean Gaussian random variables with covariance matrix . Denote by the matrix of observations, and by the empirical covariance matrix. To simplify the notation, denote by the entries of the th row of to the left of the diagonal. Further, let be the estimate for the th row, with values outside the set of indices set to zero; i.e., . Consider the following assumptions:

1. For some , as , and there exists a such that as .

2. There exists such that for all and all , .

3. There exists and some (with defined above) such that for all and for every , , where is the partial correlation between and after removing the effect of the remaining variables.

4. There exists such that for all and every and for every , .

5. There exists such that for all and for every , .

Assumption (A-3) limits the magnitude of the effects that each node in the network receives from its parents. This is less restrictive than the neighborhood selection criterion, where the effects over all neighboring nodes are assumed to be bounded. In fact, empirical data indicate that the average number of upstream-regulators per gene in regulatory networks is less than 2 (Leclerc, 2008). Thus, the size of parents of each node is small, but each hub node can affect many downstream nodes.

Assumption (A-4) is referred to as neighborhood stability and is equivalent to the irrepresentability assumption proposed by Huang et al. (2008). It has been shown that the lasso estimates are not in general variable selection consistent if this assumption is violated. Huang et al. (2008) considered adaptive lasso estimates with general initial weights and showed their variable selection consistency under a weaker form of irrepresentability assumption, referred to as adaptive irrespresentability. We will show that when the initial weights for adaptive lasso are derived from the regular lasso estimates (as in (3.4)), the assumption of neighborhood stability, as well as the less stringent assumption (A-3) are not required for establishing variable selection consistency of adaptive lasso. This relaxation in assumptions required for variable selection consistency, is a result of the consistency of regular lasso estimates, as well as the special structure of dags. However, the results of this section can be extended to adaptive lasso estimates of the precision matrix, as well as regression models with fixed and random design matrices, under additional mild assumptions.

### 4.3 Asymptotic Consistency of dag Estimation

Our first result studies the variable selection consistency of the lasso penalty.

###### Theorem 4.1 (Variable Selection Consistency of Lasso).

Suppose that (A-1)-(A-4) hold and for some and . Then there exist constants for the lasso estimation problem, such that for all , as

1. (i)] Estimation of the direction of influence:

 \operatornamewithlimitspr{\operatornamewithlimitssign(^θi,paij)=\operatornamewithlimitssign(θi,paij)% for allj∈pai}=1−O{exp(−c(i)nζ)}
2. Control of type I error:

.

3. Control of type II error: .

4. Let be the lasso estimate for the set of edges in the network. Then

 \operatornamewithlimitspr(^E=E)=1−O{exp(−c(iv)nζ)}.
###### Proof.

The proof of this theorem follows from arguments similar to those presented in Meinshausen and Bühlmann (2006), with minor modifications and replacing conditional independence in undirected graphs with d-separation in dags. ∎

The next result establishes similar properties for adaptive lasso estimates, without the assumptions of neighborhood stability. The proof of Theorem 4.3 makes use of consistency of a class of sparse estimates of the Cholesky factors of covariance matrices, established in Theorem 9 of Lam and Fan (2008). For completeness, we restate a simplified version of the theorem for our lasso problem, for which and the eigenvalues of the covariance matrix are bounded (see Remark 2.2). Throughout this section, we denote by the total number of nonzero element of the true adjacency matrix, of the dag.

###### Theorem 4.2 (Lam and Fan (2008)).

If and , then .

It can be seen from Theorem 4.2 that lasso estimates are consistent as long as . To take advantage of this result, we replace (A-0) with the following assumption

1. (A-)] For some , as . Also, as , where as .

Assumption (A-) further restricts the number of parents of each node and also enforces a restriction on the total number of nonzero elements of the adjacency matrix. Condition , implies that . Therefore, although the consistency of adaptive lasso in Theorem 4.3 is established without making any further assumptions on the structure of the network (compared to Theorem 4.1), it is achieved at the price of requiring higher degree of sparsity in the network. We now state the main result regarding variable selection consistency of adaptive lasso. Note that the theorem only requires assumptions (A-), (A-1) and (A-2), and assumptions (A-3) and (A-4) are no longer required.

###### Theorem 4.3 (Variable Selection Consistency of Adaptive Lasso).

Consider the adaptive lasso estimation problem, where the initial weights are calculated using regular lasso estimates of the adjacency matrix of the graph in (3.7). Suppose (A-) and (A-1)-(A-2) hold and for some and . Also suppose that the initial lasso estimates are found using a penalty parameter that satisfies . Then there exist constants such that for all , as (i)-(iv) in Theorem 4.1 hold.

###### Proof.

A proof is given in the Appendix. ∎

### 4.4 Choice of the Tuning Parameter

Both lasso, as well as adaptive lasso estimates of the adjacency matrix, depend on the choice of the tuning parameter . Different methods have been proposed for selecting the value of tuning parameter, including cross validation (Rothman et al., 2008; Fan et al., 2007) and Bayesian Information Criteria (bic) (Yuan and Lin, 2007). In both of these methods, the value of is chosen to minimize the (penalized) likelihood function. However, choices of that result in the optimal classification error do not guarantee a small error for the network reconstruction. We propose next a choice of for dags. Consider the following choice of the tuning parameter for the general weighted lasso problem with weights . Let denote the

th quantile of standard normal distribution, and define

 λi(α)=2n−1/2Z∗α2p(i−1) (4.1)

The following theorem, states that such a choice controls the probability of falsely joining two distinct ancestral sets, defined next.

###### Definition 4.4.

For every node , the ancestral set of node , consists of all nodes such that is an ancestor of or is an ancestor of or and have a common ancestor .

###### Theorem 4.5 (Choice of the Tuning Parameter).

Under the assumptions of Theorems 4.1 and 4.3 above (for lasso and adaptive lasso, respectively), for all the solution of the general weighted lasso estimation problem with tuning parameter determined in (4.1) satisfies

 \operatornamewithlimitspr(∃i∈V:^ANi⊈ANi)≤α
###### Proof.

A proof is given in the Appendix. ∎

Note that as in Meinshausen and Bühlmann (2006), Theorem 4.5 is true for all values of and . However, the theorem does not provide any guarantee on false positive or false negative probabilities for individual edges in the graph. We also need to determine the optimal choice of penalty parameter for the first phase of the adaptive lasso, where the weights are estimated using lasso. Since the goal of the first phase is to achieve prediction consistency, cross validation can be used to determine the optimal choice of . On the other hand, it is easy to see that the error-based proposal in (4.1) satisfies the requirement of Theorem 4.2 and can therefore be used to define . It is however recommended to use a higher value of significance level in estimating the initial weights, in order to prevent an over-sparse solution.

## 5 Performance Analysis

### 5.1 Preliminaries

In this section, we consider examples of estimating dags of varying number of edges from randomly generated data. To randomly generate data from dags, one needs to generate lower-triangular adjacency matrices with sparse nonzero elements, . We use the random dag generator in the R-package pcalg (Kalisch and Bühlmann (2007)) which also controls the neighborhood size. The sparsity levels in dags with different sizes are set according to the theoretical bounds in section 4, as well as the recommendations of (Kalisch and Bühlmann, 2007), for neighborhood size. More specifically, in simulations throughout this section, we use a maximum neighborhood size of 5, while limiting the total number of true edges to be equal to the sample size .

Different measures of structural difference can be used to evaluate the performance of estimators. The Structural Hamming Distance (shd) between the structures of the estimated and true dags, represents the number of edges that are not in common between the two graphs and is equal to the sum of false positive and false negative edges in the estimated graph. The main drawback of this measure is its dependency on the number of nodes, as well as the sparsity of the network. The second measure of goodness of estimation considered here is the Matthews Correlation Coefficient (mcc). mcc is commonly used to assess the performance of binary classification methods and is defined as

 \textscmcc=(\textsctp×\textsctn)−(\textscfp×\textscfn)(\textsctp+\textscfp)(\textsctp+\textscfn)(\textsctn+\textscfp)(\textsctn+\textscfn)1/2 (5.1)

where tp, tn, fp and fn correspond to true positive, true negative, false positive and false negative, respectively. The value of mcc ranges from to with larger values corresponding to better fits ( and represent worst and best fits, respectively). Finally, in order to compare the performance of different estimation methods with theoretical bounds established in section 4.3, we also report the values of false positive rates.

The performance of both the PC-Algorithm as well as our proposed estimators based on the choice of tuning parameter in (4.1) vary with different values of significance level . In the following experiments, we first investigate the appropriate choice of for each estimator. We then compare the performance of the estimators with an optimal choice of lambda using both numerical measures of performance, as well as gray-scale images of the estimated dags with the true structure. Gray-scale images are obtained by calculating the proportion of times that a specific edge is present in the simulations (i.e. ). To offset the effect of numerical instability, we consider an edge present, if .

### 5.2 Estimation of dags from Normally Distributed Observations

We begin with an example that illustrates the differences between estimation of dags and the conditional independence networks (cig). The first two images in Figure 3 represent a randomly generated dag of size 50 along with the gray-scale image of the average precision matrix estimated based on 100 observations using the graphical lasso algorithm (implemented in the R package glasso). To control the probability of falsely connecting two components of the graph, the value of the tuning parameter for glasso is defined based on the error-based proposal of Banerjee et al. (2008) (Theorem 2). It can be seen that the cig has many more edges ( false positives compared to for lasso and adaptive lasso), and does not reveal the true structure of the underlying dag. It can be seen that although methods of estimating cigs are computationally efficient, they should not be used in applications like estimation of gene regulatory networks, where the underlying graph is directed.

In simulations throughout this section, the sample size is fixed at , and estimators are evaluated for an increasing number of nodes (). Figure 4

shows the mean and standard deviation of Hamming distances for estimates based on the PC-Algorithm (

pcalg), as well as the proposed lasso (lasso) and adaptive lasso (Alasso) methods for different values of the tuning parameter and different network sizes. For all values of and , the adaptive lasso estimate gives the best results, and the proposed penalized likelihood methods outperform the PC-Algorithm. This difference becomes more significant as the size of the network increases.

As mentioned in section 2, it is not always possible to estimate the direction of the edges of a dag and therefore, the estimate from the PC-Algorithm may include undirected edges. Since our penalized likelihood methods assume knowledge of the ordering of variables and estimate the structure of the network, in the simulations considered here, we only estimate the skeleton (structure) of the network using the PC-Algorithm. We then then use the ordering of the variables to determine the direction of the edges. The performance of the the PC-Algorithm for estimation of the partially completed dag (pcdag) may therefore be worse than the results reported here.

In the simulation results reported here, observations are generated according to the linear structural equation model (2.2) with standard normal latent variables and . Additional simulation studies with different values of and indicate that changes in do not have a considerable affect on the performance of the proposed models. On the other hand, as the magnitude of decreases, the performance of the proposed methods (as well as the pcalg) algorithm deteriorates, but the findings of the above comparison remain unchanged.

The above simulation results suggest that the optimal performance of the PC-Algorithm is achieved when . The performance of lasso and adaptive lasso methods is less sensitive to the choice of ; however, seems to deliver more reliable estimates. Similar results were also observed in simulations with other choices of and . In addition, our extended simulations indicate that the performance of adaptive lasso does not vary significantly with the choice of power (), and therefore we only present the results for .

Figure 3 represents images of estimated and true dags created based on the above considerations for tuning parameters for . Similar results are observed for , and are excluded to conserve space. Plots in figure 5 compare the performance of the three methods with the optimal settings of tuning parameters, over a range of values of . It can be seen that the values of mcc confirm the above findings based on shd. However, false positive and true positive rates only focus on one aspect of estimation at a time and do not provide a clear distinction between the methods. These plots also suggest that estimates from the pcalg may vary significantly. This variation is represented more significantly in terms of false positive rates, where standard deviations for estimates based are up to 10 times larger than those of lasso and Alasso estimates ( for pcalg compared to for lasso and Alasso).

As mentioned in section 2, the representation of conditional independence in dags adapted in our proposed algorithm, is not restricted to normally distributed random variables. Moreover, if the underlying structural equations are linear, the method proposed in this paper can correctly estimate the underlying dag

. In order to assess the sensitivity of the estimates to the underlying distribution, we performed two simulation studies with non-Normal observations. In both simulations, observations were generated according to a linear structural model. However, in the first simulation, each latent variable was generated from a mixture of a standard normal and a t-distribution with 3 degrees of freedom, while in the second simulation, a t-distribution with 4 degrees of freedom was used. The performance of the proposed algorithm for non-normal observations was similar to the case of Gaussian observations, with

Alasso providing the best estimates, and the performance of penalized methods improving as the dimension and sparsity increase.

### 5.3 Sensitivity to Perturbations in the Ordering of the Variables

Algorithm 3.2 assumes a known ordering of the variables. The superior performance of the proposed penalized likelihood methods in comparison to the PC-Algorithm may be explained by the fact that additional information about the order of the variables significantly simplifies the problem of estimating dags. Therefore, when such additional information is available, estimates using the PC-Algorithm suffer from a natural disadvantage. However, as the underlying network becomes more sparse, the network includes fewer complex structures and it is expected that the ordering of variables should play a less significant role.

Next, we study the performance of the proposed penalized likelihood methods as well as the PC-Algorithm in problems where the ordering of variables is unknown. To this end, we generate normally distributed observations from the latent variable model of section 2.1. We then randomly permute the order of variables in the observation matrix and use the permuted matrix to estimate the original dag. Figure 6 illustrates the performance of the three methods for choices of described in section 5.2. It can be seen that for small, dense networks, the PC-Algorithm outperforms the proposed methods. This is expected since the change in the order of variables causes the algorithm to include unnecessary moral edges, while failing to recognize some of the existing associations. On the other hand, as the size of the network and correspondingly the degree of sparsity in the network increase, the local structures become simpler and therefore the ordering of the variables becomes less crucial. Thus, the performance of penalized likelihood algorithms is improved compared to that of the PC-Algorithm. For the high dimensional sparse case, where the computational cost of the PC-Algorithm becomes more significant, both lasso and adaptive lasso methods outperform the PC-Algorithm.

## 6 Real Data Application

### 6.1 Analysis of Cell Signalling Pathway Data

Sachs et al. (2003) carried out a set of flow cytometry experiments on signaling networks of human immune system cells. The ordering of the connections between pathway components were established based on perturbations in cells using molecular interventions and we consider the ordering to be known a priori. The data set includes proteins and samples.

Friedman et al. (2008b) analyzed this data set using the glasso algorithm. They estimated the graph for a range of values of the penalty and reported moderate agreement (around false positive and false negative rates) between one of the estimates and the findings of Sachs et al. (2003). True and estimated signaling networks using the PC-Algorithm, as well as both lasso and adaptive lasso algorithms, along with performance measures are given in Figure 7. The estimated network using the pcalg includes a number of undirected edges. As in the simulation studies, we only estimate the structure of the network using the pcalg and determine the direction of edges by enforcing the ordering of nodes in the dag. It can be seen that the adaptive lasso and lasso provides estimates that are closest to the true structure.

### 6.2 Transcription Regulatory Network of E-coli

Transcriptional regulatory networks play an important role in controlling the gene expression in cells and incorporating the underlying regulatory network results in more efficient estimation and inference (Shojaie and Michailidis, 2009a, b). Kao et al. (2004) proposed to use Network Component Analysis to infer transcriptional regulatory network of Escherichia coli (E-coli). They also provide whole genome expression data over time (), as well as information about the known regulatory network of E-coli.

In this application, the set of transcription factors (TFs) are known a priori and the goal is to find connections among transcription factors and regulated genes through analysis of whole genome transcriptomic data. Therefore, the algorithm proposed in this paper can be used by exploiting the natural hierarchy of TFs and genes. Kao et al. (2004) provide gene expression data for 7 transcription factors and 40 genes regulated by these TFs. Figure 8 presents the known regulatory network of E-coli along with the networks estimated using three different methods as well as different measures of performance. The relatively poor performance of the algorithms in this example can be partially attributed to the small sample size. However, it is also known that no single source of transcriptomic data is expected to successfully reveal the regulatory networks and methods that combine different sources of data are considered to be more efficient. It can be seen that the PC-Algorithm can only detect one of the true regulatory connections, while the proposed algorithm, with both lasso, as well as adaptive lasso penalties, offers significant improvements over the results of the PC-Algorithm, mostly due to the significant drop in proportion of false negatives (from for the PC-Algorithm to

for adaptive lasso). Although the estimates based on lasso and adaptive lasso penalties are very similar, the choice of the best estimate depends on the performance evaluation metric.

## 7 Conclusion

We proposed efficient penalized likelihood methods for estimation of the structure of dags when variables inherit a natural ordering. Both lasso and adaptive lasso penalties were considered in this paper. However, the proposed algorithm can also be used for estimation of adjacency matrix of dags under other choices of penalty, as long as the penalty function is applied to individual elements of the adjacency matrix.

There are a number of biological applications where the ordering of the variables is known a priori. Estimation of transcriptional regulatory networks from gene expression data and reconstruction of causal networks from temporal observations, based on the concept of Granger causality (Granger, 1969) are areas of potential application for the proposed algorithm. Our simulation studies also indicate that the correct ordering of variables is less crucial for estimation high dimensional sparse dags. Thus, even in high dimensional sparse applications where the ordering among variables is not known, the methods proposed in this paper may be an efficient alternative for search based methods of estimating dags.

## Appendix: Technical Proofs

Recall from section 4.2 that denotes the entries of the th row of the adjacency matrix to the left of the diagonal. Also, denote by the estimate for the th row with values outside the set of indices set to zero and let be the component of .

The following lemma is a consequence of the Karush Kuhn Tucker conditions for the general weighted lasso problem and is used in the proof of Theorems 4.3 and 4.5.

###### Lemma 7.1.

Let be the general weighted lasso estimate of , i.e.

 ^θi,I=\operatornamewithlimitsargminθ:θk=0∀k∉I{n−1∥Xi−Xθ∥22+λp∑k=1|θk|wik} (7.1)

Let

 Gj(θ)=−2n−1X\sf\tiny Tj(Xi−Xθ)

and be the vector of initial weights in adaptive lasso estimation problem. Then a vector with is a solution of (7.1) iff if and if . Moreover, if the solution is not unique and for some solution , then for all solutions of (7.1).

###### Proof.

The proof of the lemma is identical to the proof of Lemma (A.1) in Meinshausen and Bühlmann (2006) (except for inclusion of general weights ) and is therefore omitted. ∎

###### of Theorem 4.3.

To prove (i), note that by Bonferroni’s inequality, and the fact that as , it suffices to show that there exists some such that for all and for every ,

Let be the estimate of in (7.1), with the th component fixed at a constant value ,

 ^θi,pai(β)=\operatornamewithlimitsargminθ∈Θβ{n−1∥Xi−Xθ∥22+λp∑k=1|θk|wk} (7.2)

where . Note that for , is identical to . Thus, if , there would exist some with such that is a solution to (7.2). Since , it suffices to show that for all with , with high probability, can not be a solution to (7.2). Without loss of generality, we consider the case where ( can be shown similarly). Then if , from Lemma 7.1, can be a solution to (7.2) only if . Hence, it suffices to show that for some and all with ,

 \operatornamewithlimitspr[supβ≤0{Gj(^θi(β))<−λwij}]=1−O{exp(−c(i)nζ)} as n→∞ (7.3)

Define,

 Ri(β)=Xi−X^θi(β) (7.4)

For every we can write,

 Xj=∑k∈pai∖{j}θj,pai∖{j}kXk+Zj (7.5)

where is independent of . Then by (7.5),

 Gj(^θi(β))=−2n−1Z\sf% \tiny TjRi(β)−∑k∈pai∖{j}θj,pai∖{j}2n−1Xk\sf\tiny TRi(β)

By Lemma 7.1, it follows that for all , , thus,

 Gj(^θi(β))≤−2n−1Z\sf% \tiny TjRi(β)+λ∑k∈pai∖{j}|θj,pai∖{j}|wik (7.6)

Using the fact that , it suffices to show that

 \operatornamewithlimitspr⎡⎣supβ≤0{−2n−1Z\sf\tiny TjRi(β)}<−λ∑k∈paiwik⎤⎦=1−O{exp(−c(i)nζ)}%asn→∞, (7.7)

or equivalently,

 \operatornamewithlimitspr⎡⎣infβ≤0{2n−1Z\sf\tiny TjRi(β)}<λ∑k∈paiwik⎤⎦=O{exp(−c(i)nζ)} % as n→∞. (7.8)

It is shown in Lemma A.2. of Meinshausen and Bühlmann (2006) that for any , there exists such that for all with

 \operatornamewithlimitspr[infβ≤0{2n−1Z\sf\tiny TjRi(β)}≤qλ]=O{exp(−c(i)nζ)} as n→∞. (7.9)

However, by definition and therefore, , which implies that (i) follows from 7.9.

To prove (ii), note that the event is equivalent to the event that there exists a node such that , i.e.

 \operatornamewithlimitspr(^pai⊆pai)=1−\operatornamewithlimitspr(∃j∈i−1–––––∖pai:^θij≠0) (7.10)

By Lemma 7.1, and using the fact that by definition

 \operatornamewithlimitspr(∃j∈i−1–––––∖pai:^θij≠0) = \operatornamewithlimitspr(∃j∈i−1–––––∖pai:|Gj(^θi,pai)|≥wijλ) ≤ \operatornamewithlimitspr(∃j∈i−1–––––∖pai:|Gj(^θi,pai)|≥qλ and wijλ≤qλ for some q≥1) ≤ \operatornamewithlimitspr(∃j∈i−1–––––∖pai:wij≤q for some q≥1)

Since , with the lasso estimate of the adjacency matrix from (3.8), using Lemma 7.1,

 \operatornamewithlimitspr(∃j∈i−1–––––∖pai:wij≤q for some q>0) = \operatornamewithlimitspr(∃j∈i−1–––––∖pai:|~θij|≥q−1/γ for some q≥1) ≤ \operatornamewithlimitspr(∃j∈i−1–––––∖pai:|~θij|≥q′ for some q′>0) ≤ \operatornamewithlimitspr(∃j∈i−1–––––∖pai:~θij≠0) = \operatornamewithlimitspr(∃j∈i−1–––––∖pai:|Gj(~θi,pai)|≥λ0)

Since , we can assume without loss of generality that , which implies that is an almost sure unique solution to (7.1) with . Let

 E={maxj∈i−1––––∖pai|Gj(~θi,pai)|<λ0}.

Then conditional on event , it follows from the first part of Lemma 7.1 that is also a solution of the unrestricted weighted lasso problem (7.1) with . Since , it follows from the second part of Lemma 7.1 that . Hence

 \operatornamewithlimitspr(∃j∈i−1–––––∖pai:~θij≠0)≤1−\operatornamewithlimitspr(E)=\operatornamewithlimitspr(maxj∈i−1––––∖pai|Gj(~θi,pai)|≥λ0) (7.11)

where,

 Gj(~θi,pai)=−2n−1X\sf\tiny Tj(Xi−X~θi,pai) (7.12)

Since for some , Bonferroni’s inequality implies that to verify (7.10) it suffices to show that there exists a constant such that for all ,

 \operatornamewithlimitspr(|Gj(~θi,pai)|≥λ0)=O{exp(−c(ii)nζ)}\thickspace as n→∞. (7.13)

where and is independent from . Similarly, with satisfying the same requirements as , we get