# High-Dimensional Joint Estimation of Multiple Directed Gaussian Graphical Models

We consider the problem of jointly estimating multiple related directed acyclic graph (DAG) models based on high-dimensional data from each graph. This problem is motivated by the task of learning gene regulatory networks based on gene expression data from different tissues, developmental stages or disease states. We prove that under certain regularity conditions, the proposed ℓ_0-penalized maximum likelihood estimator converges in Frobenius norm to the adjacency matrices consistent with the data-generating distributions and has the correct sparsity. In particular, we show that this joint estimation procedure leads to a faster convergence rate than estimating each DAG model separately. As a corollary we also obtain high-dimensional consistency results for causal inference from a mix of observational and interventional data. For practical purposes, we propose jointGES consisting of Greedy Equivalence Search (GES) to estimate the union of all DAG models followed by variable selection using lasso to obtain the different DAGs, and we analyze its consistency guarantees. The proposed method is illustrated through an analysis of simulated data as well as epithelial ovarian cancer gene expression data.

## Authors

• 28 publications
• 26 publications
• 37 publications
• ### Penalized Likelihood Methods for Estimation of Sparse High Dimensional Directed Acyclic Graphs

Directed acyclic graphs (DAGs) are commonly used to represent causal rel...
11/28/2009 ∙ by Ali Shojaie, et al. ∙ 0

• ### Direct Estimation of Differences in Causal Graphs

We consider the problem of estimating the differences between two causal...
02/15/2018 ∙ by Yuhao Wang, et al. ∙ 0

• ### Targeted Fused Ridge Estimation of Inverse Covariance Matrices from Multiple High-Dimensional Data Classes

We consider the problem of jointly estimating multiple precision matrice...
09/26/2015 ∙ by Anders Ellern Bilgrau, et al. ∙ 0

• ### Node-Based Learning of Multiple Gaussian Graphical Models

We consider the problem of estimating high-dimensional Gaussian graphica...
03/21/2013 ∙ by Karthik Mohan, et al. ∙ 0

• ### Entropy inference and the James-Stein estimator, with application to nonlinear gene association networks

We present a procedure for effective estimation of entropy and mutual in...
11/21/2008 ∙ by Jean Hausser, et al. ∙ 0

• ### High-dimensional regression over disease subgroups

We consider high-dimensional regression over subgroups of observations. ...
11/03/2016 ∙ by Frank Dondelinger, et al. ∙ 0

• ### Learning complex dependency structure of gene regulatory networks from high dimensional micro-array data with Gaussian Bayesian networks

Gene expression datasets consist of thousand of genes with relatively sm...
06/28/2021 ∙ by Catharina Elisabeth Graafland, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Directed acyclic graph (DAG) models, also known as Bayesian networks, are widely used to model causal relationships in complex systems across various fields such as computational biology, epidemiology, sociology, and environmental management

[1, 11, 30, 35, 39]. In these applications we often encounter high-dimensional datasets where the number of variables or nodes greatly exceeds the number of observations. While the problem of structure identification for undirected graphical models in the high-dimensional setting is quite well understood [34, 25, 10, 4, 44], such results are just starting to become available for directed graphical models. The difficulty in identifying DAG models can be attributed to the fact that searching over the space of DAGs is NP-complete in general [5].

Methods for structure identification in directed graphical models can be divided into two categories and hybrids of these categories. Constraint-based methods, such as the prominent PC algorithm, first learn an undirected graph from conditional independence relations and in a second step orient some of the edges [14, 39]. Score-based methods, on the other hand, posit a scoring criterion for each DAG model, usually a penalized likelihood score, and then search for the network with the highest score given the observations. An example is the celebrated Greedy Equivalence Search (GES) algorithm, which can be used to greedily optimize the -penalized likelihood such as the Bayesian Information Criterion (BIC) [6]. High-dimensional consistency guarantees were recently obtained for the PC algorithm [19] and for score-based methods [23, 28, 43].

Existing methods have focused on estimating a single directed graphical model. However, in many applications we have access to data from related classes, such as gene expression data from different tissues, cell types or states [24, 37], different developmental stages [2], different disease states [41], or from different perturbations such as knock-out experiments [8]. In all these applications one would expect that the underlying regulatory networks are similar to each other, since they stem from the same species, individual or cell type, but also have important differences that drive differentiation, development or a certain disease. This raises an important statistical question, namely how to jointly estimate related directed graphical models in order to effectively make use of the available data.

Various methods have been proposed for jointly estimating undirected Gaussian graphical models. To preserve the common structure, Guo et al. [15] suggested to use a hierarchical penalty and Danaher et al. [7] suggested the use of a generalized fused lasso or group lasso penalty. While both approaches achieve the same convergence rate as the individual estimators, Cai et al. [3] were able to improve the asymptotic convergence rate of joint estimation using a weighted constrained minimization approach. Bayesian methods have been proposed for this problem [32] as well. Related works also include [27], where it is assumed that the networks differ only locally in a few nodes and [21, 38], where the assumption is that the networks are ordered and related by continuously changing edge weights.

In this paper, we propose a framework based on -penalized maximum likelihood estimation for jointly estimating related directed Gaussian graphical models. We show that the joint -penalized maximum likelihood estimator (MLE) achieves a faster asymptotic convergence rate as compared to the individual estimators. In addition, by viewing interventional data as data coming from a related network, we show that the interventional BIC scoring function proposed in [16] can be obtained as a special case of the joint -penalized maximum likelihood approach presented here. Our theoretical consistency guarantees also explain the empirical findings of [16], namely that estimating a DAG model from interventional data usually leads to better recovery rates as compared to estimating a DAG model from the same amount of purely observational data. These theoretical results are based on the global optimum of -penalized maximum likelihood estimation. To overcome the computational bottleneck of this optimization problem we propose a greedy approach (jointGES) for solving this problem by extending GES [6] to the joint estimation setting. We analyze its properties from a theoretical point of view and test its performance on synthetic data and gene expression data from epithelial ovarian cancer.

The remainder of this paper is structured as follows. In Section 2, we review some relevant background related to DAG models and introduce notation for the joint DAG estimation problem studied in this paper. In Section 3, we present the joint -penalized maximum likelihood estimator and jointGES, an adaptation of GES for solving this optimization problem. Section 4 establishes results regarding the statistical consistency of the -penalized MLE and jointGES whereas Section 5 presents the implications for learning DAG models from a mix of observational and interventional data. In Section 6, we illustrate the performance of our proposal in a simulation study and an application to the analysis of gene expression data. We conclude with a short discussion in Section 7. The proofs of supporting results are contained in the Appendix.

## 2 Preliminaries

In Section 2.1

we introduce DAGs, their relation to linear structural equation models, and discuss statistical features enjoyed by random vectors following these models. In Section

2.2 we briefly review existing approaches for estimating a single directed graphical model from observational data. Finally, Section 2.3 describes a setting where multiple related directed graphical models exist.

### 2.1 Directed acyclic graphs and linear structural equation models

Let denote a DAG with vertices and directed edges , where denotes the cardinality of . Let be the adjacency matrix specifying the edge weights of the underlying DAG , i.e., if and only if . Also, let denote a

-dimensional multivariate Gaussian random variable with zero mean and diagonal covariance matrix

. In this work, we assume that the observed random vector is generated according to the following linear structural equation model (SEM).

 X=ATX+ϵ. (1)

Hence

follows a multivariate Gaussian distribution with zero mean and covariance matrix

, where the inverse covariance (or precision) matrix is given by

 Θ=(I−A)Ω−1(I−A)T. (2)

Let denote the parents of node in , then it follows from (1) that the distribution of can be factorized as

 P(X)=p∏j=1P(Xj|XPaj(G)).

In addition, such a factorization of according to is equivalent to the Markov assumption with respect to  [22, Theorem 3.27]. Formally, given and an arbitrary subset of nodes , then

 jis\ d\text{-}separated\ fromk|S% in G⇒Xjto0.0pt$⊥$⊥Xk|XS in P. (3)

If the implication (3) holds bidirectionally, then is said to be faithful [39] with respect to . Note that there exist DAGs and that encode the same d-separations and hence the same conditional independence relations. Such DAGs are said to belong to the same Markov equivalence class.

A consequence of the acyclicity of is that there exists at least one permutation of such that for all . Putting it differently, if the rows and columns of are reordered according to , then the resulting matrix is strictly upper triangular. Hence, if such a permutation is known a priori, one can obtain the SEM parameters from according to the following steps [cf. (2)]. First, we reorder according to . Then, we perform on the reordered an upper-triangular-plus-diagonal Cholesky decomposition to obtain . Finally, we revert the ordering by permuting the rows and columns of and according to and obtain the sought . For an arbitrary permutation and a given , we denote by the Cholesky decomposition parameters obtained from the procedure just described. Alternatively, one can obtain by solving linear regressions [cf. (1)]. More precisely, we can obtain each column of by regressing only on those such that for all . Once

is obtained, one can estimate the variance of

in (1) to get . In the rest of the paper, we denote by and the true parameters of the data-generating SEM and the associated precision matrix, respectively. Moreover, we denote by the SEM parameters obtained from the described procedure when the true precision matrix is used. Notice that if is any permutation consistent with the true underlying DAG . The DAG corresponding to the non-zero entries of is known as the minimal I-MAP (independence map) with respect to . The minimal I-MAP with the fewest number of edges is called minimal-edge I-MAP [43]. If is faithful with respect to a DAG , then is a minimal-edge I-MAP of  [33, 43].

Furthermore, it has been shown in [31] that all DAGs in a Markov equivalence class share the same skeleton – i.e., the set of edges when directions are ignored – and v-structures. A v-structure is a triplet such that but and are not connected in either direction. This motivates the representation of a Markov equivalence class as a completely partially directed acyclic graph (CPDAG), which is a graph containing both directed and undirected edges. A directed edge means that all DAGs in the Markov equivalence class share the same direction for this edge whereas an undirected edge means that both directions for that specific edge are present within the class. In the same way, one can represent a subset of a Markov equivalence class via a partially directed acyclic graph (PDAG), where the directions of the edges are only determined by the graphs within the subset. In particular, some undirected edges in a CPDAG would become directed edges in a PDAG representing a subset of the class. Notice that both DAGs and CPDAGs are special cases of PDAGs, where the former represents a single graph and the latter represents the whole equivalence class.

### 2.2 ℓ0-penalized maximum likelihood estimation for a single DAG model

We denote by our observed data, where each row of represents a realization of the random vector . We say that we are in the low-dimensional setting if asymptotically remains a constant as . By contrast, whenever as , we say that we are in the high-dimensional setting. Assuming faithfulness, Chickering [6] shows that GES achieves a consistent estimator in the low-dimensional setting by optimizing the following objective – also known as the Bayesian information criterion (BIC) –

 (^A,^Ω):=argmaxA∈A,Ω∈D+ℓn(^X;A,Ω)−λ2∥A∥0, (4)

where , denotes the set of all valid adjacency matrices associated with DAGs, is the set of non-negative diagonal matrices, and is the likelihood function

 ℓn(^X;A,Ω) :=−trace(^XT^Xn⋅(I−A)Ω−1(I−A)T)+logdet((I−A)Ω−1(I−A)T). (5)

In the high-dimensional setting, van de Geer and Bühlmann [43] give consistency guarantees for the global optimum of (4) when the collection is further constrained to contain only adjacency matrices with at most incoming edges for each node, where . More precisely, they show that there exists some parameter such that the optimum in (4) converges in Frobenius norm to for increasing and , where is a permutation consistent with , i.e.,

 ∥^A−A0^π∥2F+∥^Ω−Ω0^π∥2F=O(λ2|G0|). (6)

Notice, however, that (6) does not guarantee statistical consistency since need not be a permutation consistent with the true underlying DAG. Moreover, (6) does not hold for any permutation consistent with , but [43] shows the existence of at least one such permutation. In addition, it is shown in [43] that the number of non-zero elements in , , and are all of the same order of magnitude, i.e., .

### 2.3 Collection of DAGs

Consider the setting where not all the observed data comes from the same DAG, but rather from a collection of DAGs that share the same node set . In addition, we assume that all DAGs in a collection are consistent with some permutation . This precludes a scenario where and for some . This is a reasonable assumption in, e.g., the analysis of gene expression data, where regulatory links may appear or disappear, but they in general do not change direction.

Denote by a set of SEMs on the DAGs and by the data generated from each SEM, where we observe realizations for each DAG . In this way, each row of the data matrix corresponds to a realization of the random vector defined as

 X(k)=A(k)TX(k)+ϵ(k)withϵ(k)∼N(0,Ω(k)).

Collections of DAGs arise for example naturally when considering data from perfect (also known as hard) interventions [9]. Consider a non-intervened DAG with SEM parameters [cf. (1)]. Then a perfect intervention on a subset of nodes gives rise to the interventional distribution

 XIk=AIkTXIk+ϵIk% withϵIk∼N(0,ΩIk),

where if and otherwise, and the diagonal matrix satisfies if  [16, 17]. We denote the DAG given by the non-zero entries of by .

In accordance with the notation introduced in Section 2.1, we denote by and the true data-generating DAG and SEM parameters for class , respectively, and by a permutation that consistent with for all classes . Moreover, we denote by and the DAG and SEM parameters obtained from the Cholesky decomposition of the true precision matrix when permuted by . We denote by the true covariance matrix of the SEM for class , i.e., the inverse of . Finally, we define as the union of all – i.e., an edge appears in if it appears in any – and as the union of . For interventional data, we use to denote the true SEM parameters after intervening on targets .

## 3 Joint estimation of multiple DAGs

We first present a penalized maximum likelihood estimator that is the natural extension of (4) for the case where a collection of DAGs is being estimated. Since this involves minimizing , we then discuss a greedy approach that alleviates the computational complexity of this estimator.

### 3.1 Joint ℓ0-penalized maximum likelihood estimator

With denoting a pre-specified sparsity level and indicating the proportion of observed data from DAG , we propose the following estimator:

 {^π,{(^A(k),^Ω(k))}Kk=1} :=argmaxπ,{(A(k),Ω(k))}Kk=1K∑k=1wkℓnk(^X(k);A(k),Ω(k))−λ2∥∥∥K∑k=1|A(k)|∥∥∥0 (7) subject toA(k)∈Aπ,∥A(k)∥∞,0≤d,Ω(k)∈D+∀k,

where is the set of all adjacency matrices consistent with permutation and the matrix norm computes the maximum -norm across the rows of the argument matrix. The optimization problem in (7) seeks to maximize a weighted log-likelihood of the observations (where more weight is given to SEMs with more realizations) penalized by the support of the union of all estimated DAGs. To see why this is true, notice that counts the number of entries for which for at least one graph . This penalization on the union of estimated DAGs promotes overlap in the supports of the different . Regarding the constraints in (7), the first constraint imposes that all estimated DAGs are consistent with the same permutation , which is itself an optimization variable. This constraint is in accordance with our assumption in Section 2.3 and drastically reduces the search space of DAGs. The second constraint ensures that the maximum in-degree in all graphs is at most , and the last constraint imposes the natural requirement that all noise covariances are diagonal and non-negative.

Notice that (7) is a natural extension of (4). Indeed, for the case the objective in (7) immediately boils down to that in (4). Moreover, when there is only one graph and can be selected freely, the constraint is effectively identical to , i.e., the constraint in (4). Finally, observe that in (7) we have included the additional maximum in-degree constraint required in the high-dimensional setting [cf. discussion after (5)].

### 3.2 JointGES: Joint greedy equivalence search

The norms as well as the optimization over all permutations render the problem of (7) non-convex, thus, hard to solve efficiently. In this section, we present a greedy approach to find a computationally tractable approximation to a solution to (7). The algorithm, which we term JointGES, is succinctly presented in Algorithm 1 and consists of two steps.

In the first step of Algorithm 1 we recover , our estimate of the union of all the DAGs to be inferred. We do this by finding an approximate solution to (8) via the implementation of GES [6]. The objective (scoring function) in (8) consists of two terms. The first term is given by the sum of the log-likelihoods of the achievable residues when regressing the th column of , denominated as , on for each node and DAG . In [43], van de Geer and Bühlmann further show that if we keep the underlying DAG fixed, the maximum likelihood estimator proposed in (4) is equivalent to optimizing . Thus, the first term in (8) corresponds to the first term in the objective of (7). The second term penalizes the size of the parent set of each node in the graph to be recovered, effectively penalizing the number of edges in the graph. In this way, the scoring function in (8) promotes a sparse in the same way that the objective of (7) promotes the union of all recovered graphs to have a sparse support. Additionally, it is immediate to see that the scoring function in (8) is decomposable [6], a key feature that enables the implementation of GES to find an approximate solution. Once we have obtained the union of all sought DAGs from step 1, in the second step of our algorithm we estimate the DAGs by searching over the subDAGs of . More precisely, for each node we estimate its parents in by regressing on using lasso, where the support of corresponds to the set of parents of in .

To summarize, Algorithm 1 recovers DAGs by first estimating the union of all these DAGs using standard GES and then inferring the specific weight adjacency matrices

via a lasso regression, while ensuring consistency with the previously estimated

.

## 4 Consistency guarantees

The main goal of this section is to provide theoretical guarantees on the consistency of the solution to Problem (7) in the high-dimensional setting. Our main result is presented in Theorem 4.9. In Section 4.3 we present a laxer statement of consistency based on milder conditions.

### 4.1 Statistical consistency of the joint ℓ0-penalized MLE

A series of conditions must hold for our main result to be valid. We begin by stating these conditions followed by the formal consistency result in Theorem 4.9. The rationale behind these conditions and their implications are discussed after the theorem in Section 4.2.

###### Condition 4.1.

All DAGs are minimal-edge I-MAPs.

###### Condition 4.2.

There exists a constant that bounds the variance of all the observed processes, i.e., .

###### Condition 4.3.

The smallest eigenvalues of all

are non-zero, i.e. .

###### Condition 4.4.

There exists some constant such that, for all , the maximum allowable in-degree is bounded as .

###### Condition 4.5.

For all and there exist some constants and such that

###### Condition 4.6.

The number of DAGs satisfies and the amount of data associated with each DAG is comparable in the sense that .

###### Condition 4.7.

There exists some constant such that for any permutation .

###### Condition 4.8.

There exist constants and such that , , and

 ∑i,j1{∣∣[A(k)0π]i,j∣∣>√logp/nη0(√p/|Gunion0|∨1)}≥(1−η1)|G(k)0π|, (9)

for all permutations and graphs , where denotes the indicator function and is as in Condition 4.7.

With the above conditions in place, the following result can be shown.

###### Theorem 4.9.

If Conditions 4.1-4.8 hold and is chosen such that , then there exists a constant   , that depends on

, such that with probability

the solution to (7) satisfies

 K∑k=1wk∥^A(k)−A(k)0^π∥2F+K∑k=1wk∥^Ω(k)−Ω(k)0^π∥2F=O(λ2|Gunion0|). (10)

Furthermore, denoting by the union of the graphs associated with the recovered adjacency matrices , we have that

 |^G|≍|Gunion% 0^π|≍|Gunion0|. (11)

The proof of Theorem 4.9 is given in Appendix A.2. To intuitively grasp the result in the above theorem, assume that the number of edges in is proportional to the number of nodes so that for increasing as long as . Hence, under these conditions, (10) guarantees that for the recovered permutation , the estimated adjacency matrix converges to in Frobenius norm for all . This not only implies that both adjacency matrices have similar structure, but also that the edge weights are similar. Moreover, from (11) it follows that the number of edges in the estimated graph , i.e., is similar to the number of edges in the union of all minimal I-MAPs with permutation , i.e., . More importantly, is also similar to the number of edges in the true union graph . Despite these guarantees, it should be noted that the permutation need not coincide with the permutation of the true graphs to be recovered.

We now assess the benefits of performing joint estimation of the DAGs as opposed to estimating them separately. To do so, we compare the guarantees in Theorem 4.9 to those developed in [43] for separate estimation. More precisely, the application of the consistency bound reviewed in (6) yields that for the separate estimation of DAGs, when we are in the setting where all DAGs are highly overlapping (cf. Condition 4.7), by choosing such that , one can guarantee that

 K∑k=1wk∥^A(k)−A(k)0^π(k)∥F+K∑k=1wk∥^Ω(k)−Ω(k)0^π(k)∥2F=O(Kλ2|Gunion0|), (12)

where it should be noted that in the separate estimation the recovered permutation can vary with . A direct comparison of (10) and (12) reveals that performing joint estimation improves the accuracy by a factor of from to . Hence, for joint estimation the accuracy scales with the total number of samples , showing that our procedure yields maximal gain from each observation, even if the data is generated from different DAGs. Moreover, the result in (10) holds under slightly milder conditions than those needed for (12) to hold since Condition 4.8 is a relaxed version of the beta-min condition in [43]. A more detailed discussion about the conditions of Theorem 4.9 is given next.

### 4.2 Conditions for Theorem 4.9

It has been shown in [33] that if a data-generating distribution is faithful with respect to , then must be a minimal-edge I-MAP. By enforcing the latter for every true graph, Condition 4.1 imposes a milder requirement compared to the well-established faithfulness assumption [39]. Conditions 4.2-4.4 ensure that we avoid overfitting and provide bounds for the noise variances. These are direct adaptations from Conditions 3.1-3.3 in  [43]. Condition 4.5 is required to bound the difference between the sample variances of our observations and the true variances, and is related to Condition 3.4 in [43] but adapted to our joint inference setting. Notice that Condition 4.5 is trivially satisfied when . Condition 4.6 follows from the bounds for sample variances shown in [3]. Intuitively, we are imposing the natural restriction that the number of DAGs is small compared to the number of nodes in each DAG and the total number of observations . Moreover, given that our objective is to draw estimation power from the joint inference of multiple graphs, we require that each DAG is associated with a non-vanishing fraction of the total observations.

Condition 4.7 enforces that, for every permutation , the number of edges in the union of all recovered graphs is proportional to the weighted sum of the edges in every graph as . In particular, this requires the individual graphs to be highly overlapping. To see why this is the case, notice that is upper bounded by the maximum number of edges across graphs . Consequently, Condition 4.7 enforces the number of edges in the union of graphs to be proportional to the number of edges in the single graph with most edges, thus requiring a high level of overlap. Imposing high overlap for all permutations might seem too restrictive in some settings. Nonetheless, Condition 4.7 can sometimes be derived from apparently less restrictive conditions as the following example illustrates.

Consider the more relaxed bound , which is equivalent to requiring Condition 4.7 to hold but only for permutations consistent with the true graph . In the following example, we show that this might be sufficient for Condition 4.7 to hold. Suppose that consists of two connected components and respectively defined on the subsets of nodes and . Moreover, assume that the subDAGs of over (denoted by ) are identical for all . Putting it differently, the differences between the DAGs are limited to the second connected component. In addition, assume that for all possible permutations of nodes we have that . Then, for any permutation , where we denote by (respectively ) the restriction of to the node set (respectively ), we have that

 |Gunion0π|=|G′union0π1|+|G′′union0π2|≤K∑k=1wk|G′(k)0π1|+K∑k=1wk|G′(k)0|≤2K∑k=1wk|G(k)0π|,

from where Condition 4.7 is satisfied for . This example shows that learning the structure of large components that are common across the different DAGs is not affected by the changes in the smaller components of these DAGs. Despite the previous example, Condition 4.7 might still be too restrictive in some settings. In this respect, a more relaxed requirement is stated in Section 4.3 under Condition 4.7’.

Condition 4.8 requires that, for every permutation and every graph , the value of at least a fixed proportion of the edges in is above the ‘noise level’, i.e., the lower bound within the indicator function in (9). Intuitively, if the true weight of many edges is close to zero then correct inference of the graphs would be impossible since the true edges would be mistaken with spurious ones. Thus, it is expected that the weights of a sufficiently large fraction of the edges have to be sufficiently large. Condition 4.8 is the right formalization of this intuition. Moreover, notice that a straightforward replication of the beta-min condition introduced in [43] would have required the ‘noise level’ to scale with , instead of the smaller scaling of required in (9). In this sense, Condition 4.8 is a relaxed version of the extension of the beta-min condition to the setting of joint graph estimation.

### 4.3 Consistency under milder conditions

As previously discussed, in some settings Condition 4.7 might be too restrictive. Hence, in this section we present a consistency statement akin to Theorem 4.9 that holds for a milder version of Condition 4.7. More precisely, consider the alternative condition stated next.

Condition 4.7’. Let be some function of that scales as a constant for permutations consistent with and scales as for all other permutations such that for all .

Observe that for permutations consistent with the true union graph , Condition 4.7’ boils down to the previously discussed Condition 4.7. However, for all other permutations, need not be a constant and is allowed to grow with . Intuitively, for all permutations not consistent with we are not requiring a high level of overlap among all the graphs . Nonetheless, since we do require to be ‘sparser’ than the extreme case in which all graphs are disjoint.

In order to account for the fact that depends on the permutation in Condition 4.7’, we have to modify Condition 4.8 accordingly, resulting in the following alternative statement.

Condition 4.8’. Let , then there exist constants and such that , , and

 ∑i,j1{∣∣[A(k)0π]i,j∣∣>√Cmaxlogp/nη0(√p/|Gunion0|∨1)}≥(1−η1)|G(k)0π|,

for all permutations and graphs , where denotes the indicator function.

The following consistency result holds for the alternative set of conditions.

###### Theorem 4.10.

Under Conditions 4.1-4.6, 4.7’, and 4.8’ and with such that , then there exists a constant    that depends on such that with probability  ,  the solution to (7) satisfies that, at least for one ,

 ∥^A(k)−A(k)0^π∥2F+∥^Ω(k)−Ω(k)0^π∥2F=O(λ2|G(k)0|). (13)

Furthermore, denoting by the graph associated with for the satisfying (13), we have that

 |^G(k)|≍|G(k)0^π|≍|G(k)0|. (14)

The proof is given in Appendix A.3. Condition 4.7’ is milder than Condition 4.7 and this relaxation entails a corresponding loss in the guarantees of recovery: Comparing (13) and (14) with (10) and (11) immediately reveals that what could be guaranteed for the ensemble of graphs in Theorem 4.9 can only be guaranteed for a single graph in Theorem 4.10, thereby explaining the trade-off in relaxing the conditions.

On the other hand, the result in Theorem 4.10 still draws inference power from the joint estimation of multiple graphs since neither (13) nor (14) can be shown using existing results for separate estimation. To be more precise, as discussed in Section 4.2, when performing separate estimation, theoretical guarantees are based on the assumption that at least a fixed proportion of the edge weights are above the ‘noise level’, which scales as . However, Condition 4.8’ requires the noise level to scale with which, given the fact that , is not large enough to achieve the guarantee needed for separate estimation. In addition, the convergence rate of in (13) is still faster than the corresponding convergence rate of associated with separate estimation [cf. discussion after (12)]. We close Section 4 with the following remark on the consistency guarantees of jointGES.

###### Remark 4.11 (Consistency of jointGES).

In the low-dimensional setting, by choosing and assuming faithfulness, it follows from [6] that the first step in Algorithm 1 is guaranteed to recover the true Markov equivalence class of , denoted by , in the limit of large data. This allows us to recover by successively considering all DAGs in as inputs to the second step of Algorithm 1, and then selecting the DAG whose output from step 2 is the sparsest. For the high-dimensional setting, as even the global optimum of (7) is not guaranteed to recover the true (cf. Theorems 4.9 and 4.10), jointGES is not consistent in general. Recently, Maathuis et al. [28] showed consistency of GES for single DAG estimation under more restrictive assumptions than the ones here considered. Although of potential interest, further strengthening the presented conditions to guarantee consistency of jointGES in the high-dimensional setting is not pursued in the current paper.

## 5 Extension to interventions

We show how our proposed method for joint estimation can be extended to learn DAGs from interventional data. It is natural to consider learning from interventional data as a special case of joint estimation since the DAGs associated with interventions are different but closely related. In this section we mimic some of the developments of Sections 3 and 4 but specialized for the case of interventional data. More precisely, we first propose an optimization problem akin to (7) and then state the consistency guarantees in the high-dimensional setting of the associated optimal solution.

Recall from Section 2.3 that the true adjacency matrix of the SEM associated with an intervention in the nodes is identical to the true adjacency of the non-intervened model except that for all . In this way, our assumption that there exists a common permutation consistent with all DAGs under consideration (cf. Section 2.3) is automatically satisfied for interventional data. Additionally, assuming that we observe samples from different models corresponding to the respective intervention of the nodes in , the knowledge of the intervened nodes can be incorporated into our optimization problem as follows [cf. (7)].

 {^π,^A,^Ω,{(^AIk,^ΩIk)}Kk=1}= argmaxπ,A,Ω,{(AIk,ΩIk)}Kk=1 K∑k=1wkℓnk(^XIk;AIk,ΩIk)−λ2∥A∥0 (15a) subject to A∈Aπ,∥A∥∞,0≤d,Ω∈D+, (15b) AIkij=Aij∀j∉Ik,AIkij=0∀j∈Ik, (15c) ΩIkjj=Ωjj∀j∉Ik,ΩIk∈D+. (15d)

From the solution of (15) we obtain an estimate for the non-intervened SEM as well as estimates for the corresponding intervened models . The objective in (15a) is equivalent to that in (7) where we leverage the fact that the union of all intervened graphs results in the non-intervened one under the implicit assumption that no single node has been intervened in every experiment. Alternatively, if some nodes were intervened in all experiments, objective (15a) would still be valid since enforcing zeros in the unobservable portions of does not affect the recovery of the intervened adjacency matrices . The constraints in (15b) impose that has to be consistent with permutation and with bounded in-degree, and has to be a valid covariance matrix for uncorrelated noise. Putting it differently, (15b) enforces for the non-intervened SEM what we impose separately for all SEMs in (7). The constraints in (15c) impose the known relations between the intervened and the non-intervened adjacency matrices. Finally, (15d) constrains the matrices to be consistent with the base model on the non-intervened nodes while still being a valid covariance on the intervened ones.

Even though it might seem that in (15) we are estimating SEMs (the base case plus the intervened ones), from the previous reasoning it follows that the effective number of optimization variables is significantly smaller. To be more specific, for a given , once is fixed then all the adjacency matrices are completely determined. Moreover, for a fixed , the only freedom in corresponds to the diagonal entries associated with intervened nodes in . In this way, it is expected that for a given number of samples, the joint estimation of SEMs obtained from interventional data [cf. (15)] outperforms the corresponding estimation from purely observational data [cf. (7)].

Recalling that we denote by the parameters recovered from the Cholesky decomposition of the true precision matrix under the assumption of consistency with permutation , the following result holds.

###### Corollary 5.1.

If Conditions 4.1-4.8 hold and is chosen as , then there exist constants    such that with probability  ,  the solution to (15) satisfies

 K∑k=1wk∥^AIk−AIk0^π∥2F=O(λ2|G0|). (16)

Furthermore, denoting by the graph associated with the recovered adjacency matrix for the non-intervened model, we have that

 |^G|≍|G0^π|≍|G0|. (17)

The proof is given in Appendix A.4. A quick comparison of Theorem 4.9 and Corollary 5.1 seems to indicate that the consistency guarantees of observational and interventional data are very similar. Nonetheless, recovery from interventional data is strictly better as we argue next. As discussed after Theorem 4.9, the results presented do not guarantee that the permutation recovered coincides with the true permutation of the nodes. In principle, one could recover a spurious permutation (different from the true permutation ) that correctly explains the observed data [cf. (10) and (16)] and leads to sparse graphs [cf. (11) and (17)]. However, the more interventions we have, the smaller the set of spurious permutations that can be recovered, as we illustrate in the following example. Figure 1 portrays the existence of a spurious permutation that could be recovered from observational data but cannot be recovered from interventional data. More precisely, Figure 1(a) presents the two true DAGs we aim to recover, where the second one is obtained by intervening on node 2. By contrast, Figure 1(b) shows the DAGs that are obtained when performing Cholesky decompositions on the true precision matrices under the spurious permutation . Notice that the sparsity levels of the DAGs in both figures are the same. In general, one could recover instead of from observational data, but one would never recover from interventional data. To see this, simply notice from the figure that whereas for the interventional estimate [cf. (15c)], thus, the error terms in (16) cannot vanish for . This example also indicates that it is preferable to intervene on multiple targets in the same experiment instead of doing interventions one at a time. This observation is in accordance with new genetic perturbation techniques, such as Perturb-seq [8].

From a practical perspective, the objective in (15) corresponds to the same scoring function as GIES [16]. Therefore, GIES can be used to obtain an approximate solution to (15). A simulation study using GIES was performed in [16, Section 5.2] showing that in line with the theoretical results obtained in this section, not only identifiability increases, but also the estimates obtained using interventional data are better than with the same amount of purely observational data.

## 6 Experiments

In this section, we present numerical experiments on both synthetic (Section 6.1) and real (Section 6.2) data that support our theoretical findings.

### 6.1 Performance evaluation of joint causal inference

We analyze the performance of the joint recovery of different DAGs where we vary . For all experiments, we set the number of nodes and the total number of observations . In addition, we selected the number of samples from each DAG to be the same, i.e., . For each experiment, the true DAGs were constructed in two steps. First, we generated a core graph that is shared among the DAGs under consideration. We did this by generating a random graph from an Erdős-Rényi model with edges in expectation, and then oriented the edges according to a random permutation of the nodes. In the second step, we sampled additional edges specific to each DAG uniformly at random. This procedure results in the generation of a collection of true underlying DAGs . Associated with each DAG, we generated a true adjacency matrix and a true diagonal error covariance . For the latter, we drew each error variance independently and uniformly from the interval . Regarding the adjacency matrices, we drew the edge weights independently and uniformly from to ensure that they are bounded away from zero. Notice that we did not put any constraints on the edge weights that are in the shared core structure for different DAGs: the same edge can change its weight in different DAGs, or even flip sign.

For each setting, we randomly generated collections of DAGs and data associated with them. We then estimated the DAGs from the data via two different methods: a joint estimation procedure using the jointGES algorithm presented in Section 3.2 and a separate estimation procedure using the well-established GES method [6].

To assess performance of the two algorithms, we considered two standard measures, namely the structural Hamming distance (SHD) [42] and the receiver operating characteristic (ROC) curve. SHD is a commonly used metric based on the number of operations needed to transform the estimated DAG into the true one [19, 42]. Hence, a smaller SHD value indicates better performance. The ROC curve plots the true positive rate against the false positive rate for different choices of tuning parameters.

In Figure 2 (a), we selected the -penalization parameter with scaling constant in both joint and separate estimation and then plotted average SHD as a function of the scaling constant averaged over the DAGs to be recovered and the realizations. The penalization parameter in the second step of the joint estimation procedure was chosen based on -fold cross validation. In Figure 2 (b) we plotted the average ROC curve where for each choice of tuning parameter, we computed the true positive and false positive rates by averaging over the DAGs to be recovered and the realizations. It is clear from the two figures that in general joint inference achieves better performance, which matches our theoretical results in Section 4. However, Figure 2 (a) shows also that jointGES performs worse than separate estimation for small scaling constants (). Note that this is in line with our theoretical findings in Theorem 4.10, which imply that whenever Condition 4.7 – which sometimes is a restrictive assumption – is violated, we need to choose a larger penalization parameter.

### 6.2 Gene regulatory networks of ovarian cancer subtypes

To assess the performance of the proposed joint -penalized maximum likelihood method on real data, we analyzed gene expression microarray data of patients with ovarian cancer [41]. Patients were divided into six subtypes of ovarian cancer, labeled as C1-C6, where C1 is characterized by significant differential expression of genes associated with stromal and immune cell types and with a lower survival rate as compared to the other 5 subtypes. We hence grouped the subtypes C2-C6 together and our goal was to infer the differences in terms of gene regulatory networks in ovarian cancer that could explain the different survival rates. The gene expression data in [41] includes the expression profile of patients with C1 subtype and patients with other subtypes. We implemented our jointGES algorithm (Algorithm 1) to jointly learn two gene regulatory networks: one corresponding to the C1 subtype and another corresponding to the other five subtypes together . In addition, as in [3], we focused on a particular pathway, namely the apoptosis pathway. Using the KEGG database [20, 29] we selected the genes in this pathway that were associated with at most two microarray probes, resulting in a total of genes.

Table 1 lists the number of edges discovered by jointGES as well as for two separate estimation methods, namely using the GES [6] and PC [39] algorithms. All three methods were combined with stability selection [26] in order to increase robustness of the output and provide a fair comparison. As expected, the two graphs inferred using jointGES share a significant proportion of edges, whereas the overlap is markedly smaller for the two separate estimation methods. Interestingly, under all estimation methods the network for the C1 subtype contains fewer edges than the network of the other subtypes, thereby suggesting that could lack some important links that are associated with patient survival.

To obtain more insights into the relevance of the obtained networks, we analyzed the inferred hub nodes in the three networks. For our analysis we defined as hub nodes those nodes, where the sum of the in- and out-degree was larger than 5 in the union of the two DAGs. For jointGES, this union is given by , the graph identified in the first step of Algorithm 1. The hub nodes identified by jointGES are CAPN1, CTSD, LMNB1, CSF2RB, BIRC3. Among these, CAPN1 [12], CTSD [40], LMNB1 [36], and BIRC3 [18] have been reported as being central to ovarian cancer in the existing literature. In addition, CSF2RB was also discovered by joint estimation of undirected graphical models on this data set [3]. The hub nodes discovered by GES are ATF4, BIRC2, CSF2RB, TUBA1C, MAPK3, while PC only discovered the hub node CSF2RB. While we were not able to validate the relevance of any of these genes for ovarian cancer in the literature, interestingly, CSF2RB has been identified as a hub node by all methods, thereby suggesting this gene as an interesting candidate for future genetic intervention experiments.

## 7 Discussion

In this paper we presented jointGES, an algorithm for the joint estimation of multiple related DAG models from independent realizations. Joint estimation is of particular interest in applications where data is collected not from a single DAG, but rather multiple related DAGs, such as gene expression data from different tissues, cell types or from different interventional experiments. JointGES first estimates the union of DAGs by applying a greedy search to approximate the joint -penalized maximum likelihood estimator and then it uses variable selection to discover each DAG as a subDAG of . From an algorithmic perspective, jointGES is to the best of our knowledge the first method to jointly estimate related DAG models in the high-dimensional setting. From a theoretical perspective, we presented consistency guarantees on the joint -penalized maximum likelihood estimator. and showed that the accuracy bound scales with the total number of samples, rather than the number of samples from each DAG. As a corollary to this result, we obtained consistency guarantees for -penalized maximum likelihood estimation of a causal graph from a mix of observational and interventional data. Finally, we validated our results via numerical experiments on simulated and real-world data, showing that the proposed jointGES algorithm for joint inference outperforms separate-inference approaches using well-established algorithms such as PC and GES.

The present work serves as a platform for the potential development of multiple future directions. These directions include: i) relaxing the assumption that all DAGs must be consistent with the same underlying permutation; ii) extending jointGES to the setting where the samples come from related DAGs but it is unknown a priori which particular DAG each sample comes from; this is for example of interest in the analysis of gene expression data from tumors or tissues that consist of a mix of cell types; iii) extending jointGES to allow for latent confounders.

## Appendix A Appendix: Theoretical analysis of statistical consistency results

Throughout the appendix we develop the proofs of Theorems 4.9 and 4.10. To facilitate understanding, we first provide a high-level explanation of the rationale behind the proof. If we have data generated from a single DAG and we are given a permutation consistent with the true DAG a priori, then we can estimate by performing regressions as explained in Section 2.1. By contrast, when the permutation is unknown and we need to consider all the possible permutations, the total number of regressions to run increases to . However, these regressions are not independent and, intuitively, by bounding the noise level of a subset of these regressions, we can derive bounds for the noise on the other ones. We characterize the ‘noise level’ of these regressions by analyzing the asymptotic properties of three random events. More precisely, whenever these events hold – and we show that they hold with high probability –, the noise is small enough so that the error of the -penalized maximum likelihood estimator can also be bounded. Finally, we use this upper bound in the error to show that the recovered graph converges to a minimal I-MAP that is as sparse as the true DAG.

The remainder of this appendix is organized as follows. In Section A.1 we define the three random events previously mentioned and show that each of them holds with high probability. Section A.2 then leverages the definition of these events to show Theorem 4.9, our main result. In Section A.3 we prove Theorem 4.10, which is based on relaxed conditions but similar proof techniques as in the previous section. Finally, Section A.4 fleshes out the proof of Corollary 5.1, our result applicable to the setting for interventional data.

Throughout the appendix, we use the following notation. Let denote the -th column of and denote the -th diagonal entry of . Also, denote by and the -th column of and the