1 Introduction
One important task in biology is to identify signalling pathways caused using a stimulus defined by an interventional experimental setting. The resulting cascade of chemical reactions on proteins in a cell has to be understood and quantified. A well known and publicly available data set for this kind of set up is the data collected and initially analyzed by Sachs2005. It involves measurements on proteins on cells under different experimental conditions. Often the underlying pathways are learned from these cell measurements using graphical models on directed acyclic graphs (DAGs), which are also called Bayesian networks (Pearl1988, Lauritzen1996, KollerFriedman2009). Here the associations between the random variables are modelled by a statistical model on a graph with node set and edge set . Using Markov assumptions and allowing only for acyclic graphs, the joint density of can be expressed as product of conditional densities
(1) 
where is the parent set of , i.e the set of nodes which have an edge going to . More precisely . Note that can be the empty set. The set at observed values of the parent nodes is denoted by .
Sachs2005 used a single DAG to model the pooled data collected under all experimental conditions. Since the Sachs data involves only continuous data, Gaussian DAGs or undirected Gaussian graphical models (dempster1972covariance) are often utilized. Gaussian DAG models are often constructed using a structural equation model (SEM) (mulaik2009linear, kaplan2008structural).
Many approaches have been followed to learn the underlying pathway network from this data. To name a few, friedman2008sparse used graphical lasso to construct an undirected graph for the pooled data, while ellis2008learning discretized the data and used a graphical model for discrete data. peterson2015bayesian constructed a selection prior for different experimental conditions to conduct a joint Bayesian MCMC analysis. Earlier luo2011bayesian developed a non graphical Bayesian hierarchical approach, where the regression coefficients for each experimental condition are linked by a prior. scutari2013identifying
used bootstrapping methods to identify significant edges for the Sachs data. A penalized maximum likelihood estimator of the interventional Markov equivalence classes under linear Gaussian assumptions was derived in
hauser2015jointly. castelletti2018learning, castelletti2019objective use an objective Bayes approach to learn the Markov equivalence classes in a Gaussian DAG for the Sachs data. ramsey2018faskintroduced a fast adjacency skewness (FASK) algorithm, which orients edges derived from a fitted undirected network by incorporating skewness properties of the variables. Genetic algorithms have also been applied to this data set (
jose2020bayesian).Many of the pathway learning approaches assume that there is an underlying joint Gaussian model available which allows the conditional densities in (1) as Gaussian. However this might not be true for the data considered. Already hauser2015jointly mentioned that the Gaussian assumption for the Sachs data is questionable. This was also noticed by voorman2014graph, zhang2017mixture. voorman2014graph developed a nonparametric approach to include non Gaussian behavior in a graphical model, while zhang2017mixture build a Bayesian network using a mixture copula on dimensions to model the conditional distribution of a node given the observed values of the parents to accommodate the non Gaussianity of the data and the pooling over the different experimental conditions. Here he followed the approach of elidan2010copula, who was the first to use a copula based representation of the conditional density in (1).
One special feature of the Sachs data is that there exist a biological consent graph given in Figure 3 of Sachs2005. It consists of 20 edges connecting the eleven nodes of the Sachs data. This consent graph is shown in Panel (a) of Figure 3.
In this paper we want to extend the flexibility by modeling the conditional densities in Equation (1) using a copula based approach. For this we suggest the use of vine copulas (joe2014dependence, czadobook), which have become very popular among practitioners. One major reason for this is the enormous flexibility of this class compared to standard multivariate copula families such as the class of elliptical or Archemedian copulas. In particular the vine copula class is constructed using only bivariate copulas, which can be chosen separately and data driven. For the conditional densities needed in the DAG model we use the ones derived from a fitted vine distribution of the node and its parents . The vine distribution is chosen in such a way that the conditional density of a node given its parents is available in closed form. The choice of the vine model is further optimized such that the conditional loglikelihood is maximized when parent nodes are included in a forward selection. For this we use the approach taken in kraus2017d
derived for vine copula based quantile regression models. The forward selection of the parent nodes will also be used to identify which edges in the consent graph are supported by the data and allows for a potential reduction of the edges in the graph.
Overall this approach allows us to fit very general vine copula based Bayesian networks, thus extending the copula Bayesian network of elidan2010copula. We call the resulting models vine copula based SEMs and show the need of this novel model class for one experimental setting of the Sachs data using the consent graph. We restrict to a single experimental setting to avoid accommodating clusters present in the pooled data set arising from different experimental conditions.
This new approach allows us to quantify the effects of ignoring the non Gaussianity in the data with regard to the changes in the estimated conditional densities given the parent nodes as well as on simulated conditional quantiles based on the fitted models. In summary this extension provides an important building block for learning non Gaussian networks. It is easy to fit to a given graph, can remove nonsignificant edges in the graph and allows to quantify differences in conditional densities as well as probabilities. For one experimental setting of the Sachs we show that this approach leads to a better understanding and quantification of the graphical structure.
The paper is organized as follows: In Section 2 we introduce statistical models on DAGs including Gaussian DAGs more formally. The needed background on copulas and especially on vine copulas is given in Section 3. This includes the subclass of Dvine copulas. General vine copula based regression models are discussed in Section 4. The included Dvine regression forms the building block of the novel Dvine copula based SEMs introduced in Section 5. In Section 6 we present Dvine copula based structural equation analysis for the experimental setting of the Sachs data. In Section 7 we provide simulation based evidence for the better fit of the Dvine copula based SEM compared to the standard linear Gaussian Bayesian network fit. In Section 8 we quantify differences between the fitted models based on conditional probabilities and densities. The paper closes with a summary and an outlook to future research directions.
2 Statistical models on DAGs
To describe the behavior of many variables it is important to identify how the variables interact. For many applications it is useful to describe these interaction by a network with directed edges. Following Pearl1988 a Bayesian network is a directed acyclic graph (DAG), where each node represents a random variable
. The edges connecting the nodes indicate direct causal influence between the linked variables and their strengths are quantified by conditional probabilities or densities in the case of continuous random variables.
Since we are dealing with continuous data in this paper we start with a linear SEM. For this we assume that we have a graph for the node set and edge set with directed weighted edges. Let be an associated dimensional adjacency matrix with th element given by weight , which will be estimated. Further if and only if , i.e. there is an edge . Let be a
dimensional multivariate normal random vector with zero mean vector and diagonal covariance matrix
. Then a linear SEM for is defined as(2) 
It follows that is multivariate normal with zero mean vector and covariance matrix satisfying . Using conditioning arguments the factorization (1) follows, where the the conditional densities are univariate normal densities. In general such a factorization is equivalent to the Markov assumption with respect to the graph (see Theorem 3.27 of Lauritzen1996).
Since the graph is acyclic there is at least one permutation of the numbers 1 to such that for all . Such a permutation is called a topological order. One such topological order resulting from the consent graph in Figure 3 applied to the eleven protein variables on the logarithmic scale
of the Sachs data is given in Table 1.
pip3  plc  pip2  pkc  pka  p38  jnk  raf  mek  erk  akt 
Reordering the columns and rows of according to the topological order allows to express A as a strict upper triangular matrix and thus (2) can be expressed recursively using the topological order. Utilizing (1) and the topological order we can therefore express the joint density of the Sachs data with the eleven protein variables based on the consent graph as
(3)  
Hence instead of estimating one eleven dimensional density we can estimate eleven univariate conditional densities, where each of them has at most three conditioning variables. We use the special case of a linear Gaussian Bayesian network (LGBN) as a reference model, where the weights of the conditional densities derived from the SEM (2) are estimated. In particular it satisfies
(4) 
for regression parameters
and residual variances
for . The corresponding maximized loglikelihoods as well as AIC and BIC statistics are given in Table 2 for the experimental setting of the Sachs data. We see that the protein contributes more to the overall fit compared with the other proteins, while does the opposite. This is expected in this experimental setting since we have activation of the levels for and inhibition for .Since the SEM (2) and the LGBN (4) induce marginal normality an inspection of the marginal histograms of the Sachs experiment shows skewed and multimodal histograms (see first column of Figure A4 of Appendix A1) thus indicating that the reference model might be inadequate. To extend beyond the LGBN we discuss now a copula based approach. This will allow us to construct nonlinear non Gaussian SEMs.
Node  loglikelihood  AIC  BIC 

pip3  986.90  1977.81  1987.29 
plc  921.77  1849.53  1863.75 
pip2  1249.37  2506.74  2525.70 
pkc  952.32  1912.64  1931.60 
pka  830.92  1667.84  1686.80 
p38  570.72  1149.45  1168.40 
jnk  991.87  1991.74  2010.70 
raf  969.82  1947.64  1966.60 
mek  551.50  1112.99  1136.69 
erk  968.07  1944.14  1963.09 
akt  196.03  402.06  425.75 
:  9189.29  18462.58  18666.37 
3 Copulas and vine distributions
For the analysis of large multivariate data sets multivariate statistical models are required, which can adequately describe not only the data center but also their tail behavior. For this standard multivariate distributions such as the multivariate normal or Student distribution are insufficient. They often require that all univariate and multivariate marginal distributions are of the same type and might result in symmetric tail dependence of pairs of variables. These characteristics are often not satisfied and so the copula approach developed by Sklar allows to separate the marginal behavior from the dependence structure. In particular a dimensional copula is a multivariate distribution function on the dimensional hyper cube
with uniformly distributed marginals. The corresponding copula density for an absolutely continuous copula we denote by
. The fundamental theorem of Sklar for a dimensional random vectorwith joint distribution function
and marginal distribution functions is given by(5) 
for some dimensional copula . When is absolutely continuous the associated density can be expressed as
(6) 
with marginal densities and copula density .
For absolutely continuous distributions the copula C is unique. Using (5) or (6) flexible multivariate distributions can be constructed if dimensional copulas are used. By inversion of (5) we can use any dimensional distribution function to obtain the corresponding copula. Examples for such copulas are the Gaussian and the Student copula. Note that using these copula families together with arbitrary margins results in multivariate distributions, sometimes also called meta distributions, which are much more flexible than the multivariate distribution classes used in the inversion. Another class of parametric copulas are built directly using generator functions is the class of Archimedean copulas. The Gumbel, Clayton and Frank copula families are prime examples. A nice introduction to copulas is given in genest2007 and more theoretical treatments are the books by nelsen2007introduction and Joe1997.
From (6) for we immediately can derive expression for the conditional density and distribution functions, which are needed later. In particular the conditional density and distribution function can be expressed as
(7)  
(8) 
Since vine copulas are built out of bivariate copulas we give some properties of bivariate copulas. For this we use pairwise dependence measures such as Kendall’s and Spearman’s , which are completely determined by the copula to measure the overall strength of the dependence. To assess the tail dependence upper and lower tail dependence coefficients can be used. It is easy to see that the Gaussian and the Frank copula do not have tail dependence, while the Student copula has symmetric tail dependence. Further the Clayton copula has no lower tail dependence while the Gumbel has upper tail dependence. For more details see for example Chapter 2 of czadobook.
To allow for a visual comparison between different bivariate copula families marginally normalized contour plots are helpful. For this we consider three different scales: the original scale , the copula scale and the marginally normalized scale (zscale) . Here denotes the distribution function of a random variable. Comparison of contours on the copula scale for different families is difficult, since copula densities are in general unbounded at the corners of . This is not the case if one works on the zscale. Here has margins and thus any nonelliptical contour shape indicates a deviation from a Gaussian dependence.
We now turn to estimation in a parametric setup. In a copula based model specified by (5) or (6) we have to estimate both the marginal and copula parameters. For this joint maximum likelihood estimation can be used, if the number of parameters are not too large. However it is more common to use a two step approach, by first estimating the marginal parameters based on i.i.d observations for . This can be done separately for each of the margins. In a second step pseudo copula data is formed by setting
(9) 
using the estimated probability integral transforms (PIT) . Then the copula parameters are estimated based on the pseudo copula data. If parametric marginal models are used we speak of an inference for margins approach (IFM) and if the empirical distribution is applied to the margins we have a semiparametric approach. The efficiency of the IFM approach has been investigated by joe2005, while the semiparametric approach has been proposed by Genest.
In the case where parametric marginal and copula models do not fit the data, we can also use a nonparametric approach. For the margins kernel density based estimates of the distribution function can be used. For the bivariate copula case also many nonparametric estimators are available. Their efficiency have been investigated in nagler2014kernel. The results showed that bivariate kernel density based estimation is preferred and this option is also available in rvinecopulib of rvinecopulib2019 to fit the more general class of vine distributions, which are build using bivariate copulas. We now give a short exposition of this class.
While the catalogue of bivariate parametric copula families is large, this is not the case for . Therefore the aim of early research in vine copulas was to find a way to construct multivariate copulas using only bivariate copulas as building blocks. The appropriate tool to obtain such a construction is conditioning and Joe96 gave the first pair copula construction of a multivariate copula in terms of distribution functions, while Bedford3, Bedford2 independently developed constructions in terms of densities. Additionally they provided a general framework to identify all possible constructions. We shortly illustrate this construction for starting with the recursive factorization
(10) 
and treat each term separately. To determine we consider the bivariate conditional density . This density has and as marginal distributions (densities) with associated conditional copula density . More specifically denotes the copula density associated with the conditional distribution of given . Applying (6) to gives
(11) 
Now is the conditional density of given which can be determined using (7) applied to (11) yielding
(12) 
Finally direct application of (7) gives
(13)  
(14) 
Inserting (12), (13) and (14) into (10) yields a pair copula decomposition of an arbitrary three dimensional density as
(15)  
We see that the joint three dimensional density can be expressed in terms of bivariate copulas and (conditional) distribution functions. However this decomposition is not unique, since
(16)  
(17)  
are two different decompositions using a reordering of the variables in (10). In these decompositions we have in general different conditional copula densities when the value of the conditioning variable is changing. This is intractable for estimation and therefore often the so called simplifying assumption is made, i.e the dependence on the conditioning value is ignored. For example we set in (15) for every choice of . Bivariate copulas are also referred to as pair copulas. Using the simplifying assumption in (15) we speak of a pair copula construction and not of a decomposition. In this setup we can estimate both marginal and copula parameters in a parametric setup or use a nonparametric approach for both margins and bivariate copulas. Note that the specification of marginal and pair copula distributions can be done independently thus we have constructed a very flexible class of three dimensional densities.
This construction principle involving only marginal distributions and pair copulas can be extended to arbitrary dimensions. For arbitrary dimension we need a building plan identifying which bivariate conditional distributions and their associated bivariate copulas are needed in the construction. This building plan is specified through a set of linked trees, where nodes in the next tree are edges of the current tree. Further the edges of the next tree have to satisfy the so called proximity condition. This tree structure was called by Bedford3, Bedford2 a vine tree structure. The corresponding edges identify the pair copulas needed to construct the dimensional density called a regular (R) vine distribution. Since the number of such vine tree structures grows super exponentially (morales2011count), we need a greedy algorithm such as the one developed by dissmann2013selecting. Under the simplifying assumption parameter estimation can be performed in a stepwise fashion over the pair copula terms in a tree. This approach was developed by aas2009pair and applied to financial data. There are two simple sub classes of Rvine distributions. The first one is called a Dvine distribution and uses as allowed tree structures only a path of all nodes in the tree. The class of Cvines occur in the case that all trees are stars with a root node in the center.
For a detailed introduction to pair copula constructions and the resulting vine distributions consult czadobook and joe2014dependence. The use of such vine copula based models has grown tremendously. aas2016pair and czadonagler2022 give reviews and point to current developments.
4 Vine copula based regression models
After constructing multivariate distributions we are interested in constructing a vine copula based regression model to allow for nonlinear regression effects. In a first approach chang2019prediction, chang19a, cooke2019vine build a vine distribution for the the covariates only and then add the response to the vine tree structure such that the resulting structure remains a vine tree structure and allows to specify the conditional response distribution in closed form. In this way numerical integration over the covariates is avoided. This first focus on the joint distribution of the covariates is not natural in the context of a regression problem, therefore an alternative is to start with the response as first node and add covariates one by one in such a way that again the resulting structure is a vine tree structure and the associated density of the response given the covariates is available in closed form. The adding a covariates stops if the conditional (penalized) loglikelihood does not increase any more when a further covariate is added. This approach was first developed for Dvine models by kraus2017d and later extended to certain Rvine structures by zhu2021simplified. tepegjozova2021nonparametric considered both D and Cvines, but generalized the procedure to look two steps ahead before adding a new covariate. While the D and Cvine based regression methods are feasible in large dimensions, the search for Rvines based regression is restricted to smaller dimensions because of the huge number of allowed Rvine tree structures.
5 Dvine copula based structural equation models
We now extend the LGBN model (4) to allow for non linear effects of the parents on a node. For this we recall the factorization (1) and formulate for the conditional density of the node given its parents a vine copula based regression model. If we use the Dvine based regression formulation discussed in Section 4 then we can express the conditional density of node given its parents , where is the number of parents of node for as
(18) 
where
and . This will now form the building block for the factorization (1) and in view of (4) we call the model specified by (1) and (18) a Dvine copula based SEM.
The order of the parent nodes in the Dvine based conditional density (18) is apriori not fixed, we will however later choose the order as obtained by using the methods developed by kraus2017d and described in Section 4. Thus we start with the most important parent node with regard to the conditional (penalized) loglikelihood of the node and continue adding parent nodes until this quantity can no longer improved or all parent nodes are included.
Since the forward selection of the parent nodes also includes the possibility of not including all parents this gives a way to remove edges from the starting consent graph. These edges are then not supported by the data. We will now apply this class of copula based SEMs to the experimental setting of the Sachs data.
6 Analysis of experimental setting from the Sachs data
For this analysis we used the data from https://science.sciencemag.org/content/308/5721/523/tabfiguresdata on the logarithmic scale. Recall that using the consent graph given in Figure 3 results in the density decomposition (2) for the eleven protein variables. Next we want to utilize the Dvine copula based SEM introduced in Section 5. This entails that every density term in (2) is modeled by using the Dvine regression introduced in Section 4. Using (18) it follows that (2) can be rewritten as
(19)  
Note that if a node has more than one parent, the order of the Dvines might change depending on the chosen marginals and copulas. In this case we use numbers to denote the position in the order of the Dvine instead of the explicit nodes. Hence, ”” denotes the first parent of the modeled node, ”” the second parent in the Dvine regression and so on. For our analysis we have 845 observations available for the experiment. In this experiment the reagent AntiCD3/CD28 ^{1}^{1}1General perturbation: Activates T cells and induces proliferation and cytokine production. Induced signaling through the T cell receptor (TCR), activated ZAP70, Lck, PLCg, Raf, Mek, Erk, and PKC. The TCR signaling converges on transcription factors NFkB, NFAT, and AP1 to initiate IL2 transcription (Sachs2005) and AKT ^{2}^{2}2Inhibitor specific perturbation: Binds inositol pleckstrin domain of AKT and blocks AKT translocation to the membrane where normally AKT becomes phosphorylated and active [median inhibitory concentration (IC50) 0 5 mM]. Inhibition of AKT and phosphorylation of AKT substrates are needed to enhance cell survival (Sachs2005) are having an activation effect on the protein and and an inhibitor effect on according to Figure 2 of Sachs2005.
To estimate the parameters of the marginal and copula terms in Equation (6) we follow the IFM approach discussed in Section 3. So in a first step we have to estimate the margins and then form appropriate pseudo data as in (9). In the second step we fit a Dvine copula to the node as first variable and its parents as remaining variables in the described forward approach.
To see how the fit of the model varies, when using different marginals (M) or copulas (C
) we fit three different copula based SEMs. For the first model we use kernel density estimates as margins and allow for parametric and nonparametric copulas (
MC), for the second one we use Gaussian mixture margins since we observe some multi nodal margins and the same set of copulas (MC) and for the last model we only allow for Gaussian margins and Gaussian copulas (MC). A non parametric pair copula is chosen, when it fits better than any implemented parametric family and is implemented in vinereg of vinereg.The fit of the different marginals is shown in Figure A4 of Appendix A1 comparing marginal histograms to the fitted density, qq plots and checking if the data after applying the PIT using the estimated marginal densities is approximately uniformly distributed. We observe that using Gaussian margins for the PIT in general does not result in uniformly distributed data indicating a bad fit. On the other hand, Gaussian mixture margins and kernel density estimates fit the margins very well. It is hard to distinguish between these two models. We also consider the goodness of fit measures loglikelihood, AIC and BIC of the fitted marginals. The results are displayed in Table A1 in Appendix A2 and show that Gaussian margins give the worst fit.
To support the use of copula based SEMs instead of linear Gaussian ones, we give normalized contour plots in Figure A8 of Appendix A1. From there we see non elliptical shapes in the lower triangular panels regardless which marginal models were used, thus pointing towards the need to capture non Gaussian dependence.
In the next step we use the fitted marginals to transform the data to the copula scale by applying the estimated PIT. On these new data sets we then fit copulas according to three defined models.
Using kernel density estimates for the PIT and then fit a Dvine regression model for each node with at least one parent results in a loglikelihood of 2316.23, an AIC of 4228.73 and a BIC of 3272.06 of the copula terms. Using Gaussian mixture margins and the same set of copula families results in a loglikelihood of 2292.34, an AIC of 4189.49 and a BIC of 3252.98 of the copula terms. The last model where Gaussian margins and Gaussian copulas are used worsens the three measures to a loglikelihood of 1497.84, an AIC of 2971.68 and a BIC of 2914.80 of the copula terms. More detailed results are displayed in Table A2 and Table A3 on the copula scale as well as contour plots of the fitted copulas in Figure A12 of Appendix A3. They show that Gaussian pair copulas are only seldom chosen in the non Gaussian SEMs. Further the parametric pair copula families contained in vinereg are not sufficient and many nonparametric pair copulas (tll) are selected. We see strong similarities on the fitted copula scale between the MC model and the MC model.
If a parent is not selected in the Dvine regression for the node , this indicates that the copula density is very close to the bivariate independence copula density ( for every ) and thus the edge is not supported by the data. This implies that in both copula based non Gaussian SEMs only 17 of the 20 dependencies given by the biological consent graph are present for this experimental setting (see Panel (b) of Figure 3). The removed edges are and . The deletion of implies that there is no pathway from to . This might be a consequence of the inhibiting effect on . Further the deletion of implies that can only be reached from over , which is sensible since in this experiment is activated. The removal of is also sensible since is inhibited in this setting. The edges and also were identified as edges to be removed from the consent graph in the original analysis of Sachs2005.
The number of supported edges decreases for the MC model further. Here only 12 edges are modelled. The removal of these additional four edges (colored red in Panel (b) of Figure 3) is questionable, since as we show now that the fit of the MC is inferior to the other two copula based SEM’s.
We now include the marginal fit for the different SEMs based on Equation (18) to compute overall fit measures, which allow us to also compare to the performance of the standard LGBN. These are contained in Table 7.
Model  raf  mek  plc  pip2  pip3  erk  akt  pka  pkc  p38  jnk  : 

LGBN  970  552  922  1249  987  968  196  831  952  571  992  9189 
MC  902  399  827  921  895  822  79  766  770  246  837  7464 
MC  907  416  826  934  917  830  103  768  775  263  845  7587 
MC  970  552  922  1249  987  968  197  831  952  571  993  9193 
(a) Loglikelihood of the conditional densities
Model  raf  mek  plc  pip2  pip3  erk  akt  pka  pkc  p38  jnk  : 

LGBN  1948  1113  1850  2507  1978  1944  402  1668  1913  1149  1992  18463 
MC  1858  876  1718  1938  1813  1695  236  1552  1556  565  1750  15558 
MC  1854  889  1710  1967  1844  1702  276  1555  1569  590  1747  15703 
MC  1948  1113  1850  2505  1978  1943  401  1666  1911  1148  1991  18453 
(b) AIC of the conditional densities
Model  raf  mek  plc  pip2  pip3  erk  akt  pka  pkc  p38  jnk  : 

LGBN  1967  1137  1864  2526  1987  1963  426  1687  1932  1168  2011  18666 
MC  1989  1057  1870  2165  1868  1817  421  1599  1596  738  1930  17051 
MC  1946  1021  1847  2202  1867  1800  441  1597  1611  740  1883  16957 
MC  1967  1132  1864  2519  1987  1957  420  1676  1925  1162  2005  18614 
(c) BIC of the conditional densities
Model  raf  mek  plc  pip2  pip3  erk  akt  pka  pkc  p38  jnk  : 

LGBN  4  5  3  4  2  4  5  3  4  4  4  42 
MC  28  38  32  48  12  26  39  10  8  37  38  315 
MC  20  28  29  49  5  21  35  9  9  32  29  265 
MC  4  4  3  3  2  3  4  2  3  3  3  34 
(d) Number of (effective) parameters of the conditional densities
We observe that the LGBN and the Dvine based SEM with Gaussian margins and copulas result in almost identical results for all three goodness of fit measures. This is reasonable as both of them specify a multivariate Gaussian distribution following KollerFriedman2009 and MORALES2008699. The parameters of the conditional Gaussian distribution given by each node in the two models are displayed in Table S3 in the Supplement S1. Many similarities between these two models are visible even though all possible dependencies are modelled in the LGBN, as it was fitted optimizing the conditional loglikelihood with all parents of a node, whereas this is not the case for the MC.
Comparing the two models to the MC model and the MC model we can see a strong improvement in all goodness of fit measures in the models with parametric and nonparametric copulas. This is no surprise since we have already seen that Gaussian mixture margins and kernel density estimates provide a much better fit than Gaussian margins and allowing flexibility in the choice of the pair copula families provides an advantage. We observe that between the non Gaussian models only slight differences are visible mostly due to the large mumber of effective parameters needed for the kernel density estimates, which influences the BIC.
7 Model validation
Up to this point we have only compared the goodness of fit measures between the four model but have not evaluated if they really reproduce structures visible in the observed data. For this we sampled 845 times from each model by starting in the node
and sampling it using its fitted marginal density. We then follow the topological order, i.e for each variable in this order we simulate data for the node using the already simulated data for its parents. For the LGBN we use that we can express the conditional distribution as a conditional normal distribution. To sample from the Dvine copula based SEMs we utilize the algorithm presented in
bevacqua2017multivariate.To compare the induced pairwise dependence structures we consider the scatter plots for node pairs of the observed data and of the simulated data sets from the different models in Figure 8. Looking at the scatter plots of the MC model and the MC model we observe almost no differences. This does not surprise as we have already seen that their fitted copulas are almost identical. Comparing the scatter plots to the observed scatter from the data set we see that they closely match. For all variable pairs the shapes, even more complicated ones like between the node pairs or , are similar from the simulated data of the two fitted non Gaussian SEMs and the original data.
However comparing the scatter plots of the MC model and the LGBN instead, it is clear that both of them have problems modeling more complicated shapes as they are only able to model elliptical shaped dependencies. It seems that the MC model does this a little bit better looking at the pairs plots of and which are closer to the ones from the original data set than the ones from the LGBN. Since in Figure 8 we only considered bivariate dependence properties, we check if high/low values in several variables at once appear in a similar frequency in the data and the simulated data. Hence, we sum up over the variables in each sample and then fit kernel density estimates to the resulting data. We can then compare the kernel density estimates fitted to the sum over the data and simulated data from the investigated models. The results are displayed in Figure 9.
Comparing the fitted kernel density estimates of the sum over all nodes we see that especially the ones fitted to the simulated data from the MC model and the MC model are very close to the original data set and are only slightly shifted to the left. On the other hand, we observe that the ones fitted to the sum of the simulated data from the MC model does not reach the height of the mode compared to the original data set whereas the LGBN overestimates the height of the mode. Hence, considering the ability of the different models to recreate the data set they are fitted on the MC model and the MC model outperform the other two.
8 Conditional density and medians for given values of parent nodes
To see the effect of allowing a more general dependence structure in the SEM (18) compared to a LGBN we investigate the behavior of the conditional density of a node given its parents for the studied models. Here we like to investigate the effect when median and extreme values of the parents are used. If a node only has one parent in the DAG, we chose the , and empirical quantile of the parent node as conditioning values. This is appropriate for the nodes and and the associated conditioning values are shown in Figure S1 of Supplement S2.
Since there is no simple equivalent of quantiles in two or three dimensional distributions a different approach had to be used for all other nodes. If a node has two or three parents, we fitted a two/threedimensional kernel density estimate to the joint distribution of the parent nodes using the data. We then manually choose one point close to the mode of the fitted density, two in the tails and two more which are neither close to the mode nor in the tails. This is illustrated in Figure S2 for nodes with two parents and Figure S9 for nodes with three parents, respectively, in Supplement S2. Note that while six nodes have two parents, (, , , , , ), three of them (, and ), have the same set of parents, namely and , so we only obtain four cases for the values of the conditioning variables. The chosen values of the parent nodes are summarized in Table 8.
Node  Parent  Mode  Middle  Tail  
plc  pip3  2.68  3.56  4.38  
pip2  plc  2.40  1.80  2.80  3.80  2.90 
pip3  3.50  3.60  3.90  4.30  2.10  
pkc  plc  2.50  1.90  2.90  1.90  4.00 
pip2  4.90  5.20  4.80  2.50  5.00  
pka  pkc  1.80  2.93  3.46  
p38, jnk, raf  pka  6.00  6.00  7.00  7.50  6.50 
pkc  3.00  2.50  3.00  2.50  1.75  
mek  raf  3.62  4.79  3.49  4.43  1.97 
pka  6.10  6.52  7.40  7.64  6.63  
pkc  2.88  3.44  3.10  2.04  2.71  
erk  mek  3.10  3.50  2.90  4.50  3.00 
pka  6.10  6.20  6.40  6.00  7.50  
akt  erk  2.57  4.22  2.10  2.58  1.54 
pka  6.19  7.15  5.89  6.28  6.63  
pip3  3.73  3.68  3.99  5.41  4.66 
In the following we will denote the points which we chose close to the the mode of the fitted parent distribution as ”Mode”, the points in the tails as ”Tail” and the points between them as ”Middle”. If a node only has one parent, we will assign the points from the and the quantile to ”Tail” and the point from the quantile to ”Mode”. Note that as in the MC model not all dependencies have been modeled, in this model sometimes the number of parents change with the model.
As earlier we sample 845 times from each node of the model given the values of the parent nodes and then fitted kernel density estimates to this sample. The resulting density plots are displayed in Figure 10. As we expect, we observe only small differences in the general shape of the conditional density plots between MC and MC. The same holds when comparing the MC model and the LGBN. However the shapes of the conditional density can be very different for the non Gaussian copula based SEMs and the Gaussian SEMs, thus illustrating very different conditional effects for the chosen parent values.
As a final visualisation of the differences on relevant conditional quantities we consider the conditional median for each node given its parents. We follow the topological order as given in Table 1. For we use a sample quantile of level denoted by as value of pip3. Next we determine the median of plc given using the Dvine quantile regression of kraus2017d, we denote this median as . For the node we determine the conditional median given using Dvine quantile regression. We denote this conditional median by . We continue in this way until all conditional medians are determined. The needed parents for the different models can be found in Table A3 of Appendix A3. We investigate three different and levels and the results are visualized in Figure 14. We see pronounced differences for the nodes and .
9 Summary and future research directions
In this paper we suggested to use a Dvine copula based SEM to analyze data which obeys a graphical structure. At least a starting graph has be known, but the data does not need to follow a joint Gaussian distribution. The appropriate copula regression model for the conditional distribution of the node given its parents is very flexible, since the bivariate building blocks of the Dvine regression can be independently chosen in addition to the marginal models for the node and its parents.
The proposed approach uses data reduction in two ways, first by the choice of the graph and second by the forward selection of parent nodes ordered by importance. Edges pointing to a node in the starting graph can be removed when in the Dvine regression of the node the corresponding parent node is not selected as a covariate. This allows for an edge selection without relying on the Gaussian assumption of the data. This approach identified three edges in the network, which are not no longer supported by the experimental conditions investigated. The removal of these edges were found to be plausible given the specific experimental conditions.
We illustrated the approach using the experimental setting of the Sachs data. Here we showed a much better fit of the copula based SEM with margins fitted by mixtures of univariate normals or nonparametric kernel density estimation and using non Gaussian pair copulas.
These Dvine copula based SEMs are easy to fit and can be applied to large networks, since Dvine regressions are feasible for several hundred covariates. In addition it is easy to simulate from the model and assess conditional densities or probabilities of the nodes given observed values of the selected parent nodes.
These Dvine copula based SEM’s can also be utilized to conduct a causal analysis using the pooled Sachs data. For this we would subdivide the data in a node specific way. For a given node we would select only observations from experimental settings which do not inhibit or activate the node. This would remove the interventional effects present in the data and such an analysis is currently investigated.
While it seems like that the knowledge of an initial graph is restrictive, the approach can be combined with any structure learning approach. In particular to allow for non Gaussian structure selection the vine copula based approach of bauer2012pair to test for conditional independence in a PC algorithm (kalisch2007estimating, spirtes2011causation) can be used and the selected graph can be fitted using the proposed Dvine copula based SEM.
Two immediate extensions of the proposed modeling approach are to allow for Rvine based SEMs or to improve the forward selection procedure by looking two steps ahead instead of a step as has been proposed by tepegjozova2021nonparametric.
A non Gaussian analysis of the complete Sachs data would require the development of efficient mixtures of vine regression models to accommodate the different clusters induced by the different experimental settings. A first step in this direction is sahin2021vine, who developed a data driven algorithm for the specification of vine distributions in a mixture setup for clustering. Such a development might further improve the fit of the experimental, since for some nodes we can observe clusters in the data (see Figure A8).
Appendix
A1: Marginal fits for the copula based structural equation model
Kernel density estimates  Gaussian mixture margins  Gaussian margins  

Node  Loglik.  AIC  BIC  # Par.  Loglik.  AIC  BIC  # Par.  Loglik.  AIC  BIC 
pip3  895.20  1813.47  1868.14  11.54  916.83  1843.66  1867.36  5  986.90  1977.81  1987.29 
plc  952.81  1927.03  1977.73  10.70  951.96  1919.92  1957.83  8  979.48  1962.97  1972.44 
pip2  1214.93  2447.95  2490.80  9.04  1225.04  2466.07  2503.99  8  1338.18  2680.35  2689.83 
pkc  771.76  1558.08  1592.58  7.28  777.10  1570.20  1608.11  8  954.10  1912.21  1921.69 
pka  767.31  1552.31  1594.24  8.85  769.54  1555.08  1592.99  8  831.22  1666.44  1675.92 
p38  765.47  1546.82  1584.45  7.94  778.79  1567.58  1591.28  5  925.20  1854.40  1863.88 
jnk  923.74  1874.36  1938.03  13.44  929.73  1875.46  1913.37  8  1008.16  2020.31  2029.79 
raf  921.52  1865.16  1917.58  11.06  926.74  1863.48  1887.18  5  971.98  1947.95  1957.43 
mek  710.24  1450.37  1521.20  14.94  725.23  1460.47  1484.16  5  785.21  1574.41  1583.89 
erk  991.42  1997.43  2032.01  7.30  997.54  1999.07  2008.55  2  997.54  1999.07  2008.55 
akt  865.88  1753.80  1806.03  11.02  880.65  1771.30  1794.99  5  912.43  1828.85  1838.33 
:  9780.28  19786.78  20322.79  113.11  9879.15  19892.29  20209.81  67  10690.40  21424.77  21529.04 
We see that all three measures are extremely similar for the fitted Gaussian mixture margins and kernel density estimates. For the loglikelihood and the AIC the kernel density estimates are slightly better whereas for the BIC the Gaussian mixture margins marginally outperform kernel density estimates. This difference is due to the higher number of parameters necessary to fit kernel density estimates. Compared to these two approaches fitting Gaussian margins results overall in a worse fit.
A2: Exploring pairwise dependence under different marginal specifications
A3: Copula fits for the copula based SEM
MC  MC  MC  

Node  Order  Loglik.  AIC  BIC  Order  Loglik.  AIC  BIC  Order  Loglik.  AIC  BIC 
plc  pip3  126.09  209.26  107.55  pip3  125.75  209.73  110.76  pip3  57.72  113.43  108.69 
pip2  pip3, plc  293.69  509.66  325.46  pip3, plc  290.73  498.59  302.22  pip3  88.80  175.61  170.87 
pkc  pip2  1.82  1.64  3.09  pip2  1.84  1.68  3.06  pip2  1.73  1.46  3.28 
pka  pkc  1.08  0.16  4.58  pkc  1.13  0.26  4.48  0  0  0  
p38  pkc, pka  519.59  982.02  846.57  pkc, pka  515.32  977.28  850.79  pkc  354.14  706.28  701.54 
jnk  pkc, pka  86.66  124.26  7.99  pkc, pka  84.79  128.34  30.61  pkc  15.54  29.08  24.34 
raf  pka, pkc  20.00  6.88  71.59  pka, pkc  19.32  9.59  59.28  pka, pkc  2.15  0.30  9.18 
mek  raf, pkc, pka  310.82  574.83  463.88  raf, pkc, pka  308.73  571.78  463.55  raf, pkc  232.86  461.72  452.24 
erk  pka  169.47  302.22  215.22  pka  167.42  297.22  208.06  pka  29.16  56.33  51.59 
akt  erk, pka  787.00  1517.80  1384.65  erk, pka  777.31  1495.02  1353.81  erk, pka  715.74  1427.47  1417.99 
:  2316.23  4228.73  3272.06  2292.34  4189.49  3252.98  1497.84  2971.68  2914.80 
Effective  Copula  
(a) MC  Pair copula  Family  # parameters  loglik.  AIC  BIC  Est. Ken. 
plc  plc pip3  tll  21.46  126.09  209.26  107.55  0.24 
pip2  pip2 pip3  tll  19.77  175.60  311.66  217.98  0.05 
pip2 plc; pip3  tll  19.10  118.10  198.00  107.48  0.29  
pkc  pkc pip2  clayton  1.00  1.82  1.64  3.09  0.01 
pka  pka pkc  frank  1.00  1.08  0.16  4.58  0.03 
p38  p38 pkc  tll  27.58  517.89  980.62  849.91  0.60 
p38 pka; pkc  joe  1.00  1.70  1.40  3.34  0.01  
jnk  jnk pkc  tll  23.53  85.59  124.11  12.58  0.15 
jnk pka; pkc  gumbel  1.00  1.08  0.15  4.59  0.00  
raf  raf pka  tll  15.56  18.74  6.36  67.37  0.05 
raf pkc; pka  clayton  1.00  1.26  0.52  4.22  0.01  
mek  mek raf  tll  21.41  304.00  565.19  463.72  0.48 
mek pkc; raf  gaussian  1.00  4.54  7.09  2.35  0.06  
mek pka; raf pkc  gumbel  1.00  2.28  2.55  2.19  0.01  
erk  erk pka  tll  18.36  169.47  302.22  215.22  0.10 
akt  akt erk  tll  26.10  664.89  1277.59  1153.92  0.67 
akt pka; erk  bb8  2.00  122.11  240.21  230.73  0.26 
Effective  Copula  
(b) MC  Pair copula  Family  # parameters  loglik.  AIC  BIC  Est. Ken. 
plc  plc pip3  tll  20.88  125.75  209.73  110.76  0.24 
pip2  pip2 pip3  tll  20.32  174.14  307.64  211.35  0.05 
pip2 plc; pip3  tll  21.12  116.59  190.95  90.87  0.29  
pkc  pkc pip2  clayton  1.00  1.84  1.68  3.06  0.01 
pka  pka pkc  frank  1.00  1.13  0.26  4.48  0.03 
p38  p38 pkc  tll  25.69  513.63  975.89  854.14  0.60 
p38 pka; pkc  joe  1.00  1.69  1.39  3.35  0.01  
jnk  jnk pkc  tll  19.62  83.69  128.13  35.14  0.15 
jnk pka; pkc  gumbel  1.00  1.10  0.21  4.53  0.00  
raf  raf pka  tll  13.53  18.11  9.16  54.97  0.05 
raf pkc; pka  clayton  1.00  1.21  0.43  4.31  0.01  
mek  mek raf  tll  20.84  301.89  562.11  463.35  0.48 
mek pkc; raf  gaussian  1.00  4.67  7.33  2.59  0.06  
mek pka; raf pkc  gaussian  1.00  2.17  2.34  2.39  0.01  
erk  erk pka  tll  18.81  167.42  297.22  208.06  0.10 
akt  akt erk  tll  27.80  652.93  1250.27  1118.54  0.67 
akt pka; erk  bb8  2.00  124.38  244.75  235.27  0.26 
Copula  
(c) MC  Pair copula  Family  loglik.  AIC  BIC  Par.  Est. Ken. 
plc  plc pip3  gaussian  57.72  113.43  108.69  0.36  0.24 
pip2  pip2 pip3  gaussian  88.80  175.61  170.87  0.44  0.05 
pkc  pkc pip2  gaussian  1.73  1.46  3.28  0.06  0.01 
p38  p38 pkc  gaussian  354.14  706.28  701.54  0.75  0.60 
jnk  jnk pkc  gaussian  15.54  29.08  24.34  0.19  0.15 
raf  raf pka  gaussian  1.08  0.15  4.59  0.05  0.05 
raf pkc; pka  gaussian  1.08  0.15  4.59  0.05  0.01  
mek  mek raf  gaussian  227.58  453.17  448.43  0.65  0.48 
mek pkc; raf  gaussian  5.27  8.55  3.81  0.11  0.06  
erk  erk pka  gaussian  29.16  56.33  51.59  0.26  0.10 
akt  akt erk  gaussian  530.82  1059.63  1054.89  0.85  0.67 
akt pka; erk  gaussian  184.92  367.84  363.10  0.60  0.26 



References
Supplementary Material
S1: Comparison of linear Gaussian Bayesian network and structural equation model based on a Gaussian copula with Gaussian margins
Node 
