1 Introduction
Throughout their lifetime, organisms express their genetic program, i.e. the instruction manual for molecular actions in every cell. The products of the expression of this program are messenger RNA (mRNA); the blueprints to produce proteins, the cornerstones of the living world. The diversity of shapes and the fate of cells is a result of different readings of the genetic material, probably because of environmental factors, but also because of epigenetic organisational capacities. The genetic material appears regulated to produce what the organism needs in a specific situation. We now have access to rich genomics data sets. We see them as instantaneous images of cell activity from varied angles, through different filters. Patterns of action of living organisms can be deciphered using adequate methods from these data. Probabilistic (in)dependence relationships can most certainly be extracted from observational data, but we want to go one step further. If the data is generated from a causal structure, what clues can be found in the data to help uncover this hidden structure without resorting to intervention? Intervention is not always practical, nor ethical. In our biological network study context, even though modern techniques allow scientists to intervene on the system, e.g. knocking out one gene or blocking a molecular pathway, they are still expensive experiments. We aim to investigate which kinds of causal links can (and which cannot) be discovered from observational data, defined by biologists as wildtype and steadystate.
In short, this chapter is reviewing causal reconstruction in gene network from observational data. More specifically, it blends a presentatioan application on a real network n of Systems Biology elements, methods for gene network reconstruction and concepts in statistical causality in Section 2. This section narrows down the type of system under scrutiny, namely gene regulatory networks, although our work could apply to more detailed representation of biological systems if the data were available. Section 3 introduces, in detail, the class of method we chose to focus on: Bayesian networks. Despite being designed to decipher probabilistic (in)dependencies in the data, we test its capacity to learn causality from observational data. This is the purpose of the preliminary experiments on small networks and an application on a real network in Section 4. All simulations were performed with the aid of the R package bnlearn scutari2009 ; bnR . We conclude with a short discussions in Section 5.
2 Biological networks and causality
Biological system inner organisation is at the origin, and is the essence of the diversity of life we observe around us, while allowing organisms to adapt to their environments. They have received growing attention for the last 20 years edwards1999 ; kitano2002 ; noble2006 , and are complex both in terms of their many different kinds of interacting elements that they constitute; and in terms of the diversity of the responses that they drive. They can be understood at different levels: from the interaction of species within an ecosystem, to the finer molecular grain through interactions between cells or the communications between organs. We focus on the latter situation in the present work.
Networks for biological systems
Biological networks offer an integrated and interpretable tool for biological systems knowledge modelling and visualisation. Nodes in these networks stand for biological entities (e.g. genes, proteins), whilst edges encode interactions between these entities barabasi2004 . The definition of the biological entities is far from trivial  for example the definition of a gene is not unique gericke2007 ; pennisi2007 . In our “small world” modelling representation of the reality mcelreath2015 , we use simplified definitions of these biological elements, and coin them variables in our system. Some nodes can represent a phenotype of interest: e.g. the yield or level of resistance to a disease or to an adverse environmental factor scutari2014 . An edge, on the other hand, has a more diffuse meaning: ranging from a vague association of the measurements linked to the nodes, to a functional dependency between them driven by an explicit mechanism. We will be primarily interested in directed edges; we aim to go beyond coexpression or association networks zhang2005 ; tenenhaus2010 ; rau2015 . In the real biological world, mechanisms exist which trigger pathways within the system of a living organism monod1960 and sometimes between living organisms hentschel2000 ; dupont2015 . On the analytical side of this, we aim to providing a meaningful representation of these interactions, not to say regulations between the components in the system under study. We will use directed acyclic graphs causality , although it is known that gene networks do indeed contain feedback cycles mangan2003 ; zabet2011 ; shojaie2014 . Solutions exists to consider feedback cycles in the Bayesian network framework. For example, time can be used to unfold the successive regulations taking place between the elements of a cycle ghahramani1998 ; friedman1998 ; husmeier2003 . When a timeseries is not available, some work detail the existence of feedback cycles as a super graph associated to a family of distributions instead of a unique factorised distribution tulupyev2005 . This is essentially inspired by an inversion of the technique used to condense (aka contract) all strongly connected components (which contain cycles) into a single composite node harary1965 . Notice that each distribution would then correspond to different causal assumptions which would then need be tested using additional data. In lacerda2008 , the authors give consistency results for an algorithm to recover the true distribution of nonGaussian observational data generated from a cyclic structure.
If one is interested in representing a fine grain of biological details, interconnected networks depicting different types of biological entities can be used. A small example can be found in Figure 1 of Chapter 1 of the present book. Some apparent relationships between genes and phenotypes are often the result of direct regulatory links between different sorts of biological entities. We focus on this gene regulatory network (GRN) with the assumption that it can be deciphered from transcript levels.
Data and reconstruction methods
Modern highthroughput technologies such as RNA sequencing allow the researcher to have access to the simultaneous levels of all transcripts in a biological sample. They provide a partial snapshot of the state of the cell. A first hindrance in analysing such data set is related to their inherent noise and highdimensionality quackenbush2007 ; much more variables are observed and the low number of samples makes the reconstruction task very challenging. A vast number of methods were developed to circumvent this difficulty with difference performance guarantees buhlmann2011 ; verzelen2012 ; giraud2014 . A second obstacle is concerned with the nonlinearity of the relationships oates2014 , sometimes immersed in intricate temporal responses shojaie2010 ; rau2010 ; marchand2014 . Another feature of modern biological data sets is that of missing observations, either due to technical fault or because all relevant variables cannot be monitored chandrasekaran2012 ; adequate techniques need be implemented to deal with this blanchet2009 ; colombo2012 ; fusi2012 ; sadeh2013 . We will explicitly note the difficulty of modelling feedback cycles with traditional (yet advanced) analysis methods mooij2011 .
Many review papers can now be found in bioinformatics literature on gene network reconstruction dejong2004 ; markowetz2007 ; lee2009 ; emmertstreib2012 , and also in the statistical literature maathuis2009 ; colombo2012 ; oates2012 ; fu2013 . Other papers generally compare their relative merits and pitfalls werhli2006 ; altay2010 ; emmertstreib2010 ; marbach2010 . In this vein, the Dialogue for Reverse Engineering Assessment and Methods (DREAM) project, over the last decade, has also ran several challenges related to network reconstruction marbach2012 ; meyer2014 ; hill2016 . Lessons were learned about the power of combining complementary reconstruction methods marbach2012 ; allouche2013 , and such methods showed great performance in reconstructing real networks from genuine data sets related to cancer bontempi2011 ; engelmann2015 , protein signaling sachs2005 ; ness2016 , or to fitnessrelated data gagneur2013 . When only observational data is available, bounds on causal effects can be retrieved (see the discussion on EoC or CoE below as well) and verified by means of intervention (knockout) experiments maathuis2010 ; taruttis2015 . In no instance are we pretending to be exhaustive, since this research area is vast and constantly expanding. Our focus is to test the ability of methods to decipher causal links from steadystate observational data. For example, we ignore singlecell data sachs2005 and gene expression timeseries modelling oates2014 ; michailidis2013 . Neither do we consider the addition/integration of biological knowledge or of complimentary data sets werhli2007 or a supervised framework mordelet2008 . Lastly, we do not cover purely interventional designs eberhardt2005 ; hauser2014 ; meinshausen2016 . In some sense, our work is related to the work of mooij2016 or that of athey2016 , but we explore beyond the causeeffect pair of variables or the treatment effect. Borrowing ideas developed in any of these paradigms would certainly be a winning strategy, if they revealed useful information in our transcript level steadystate context.
The reader interested in a classified list of gene network reconstruction is encouraged to read through Sections 3 and 4 of Chapter 1 of the present book. We note here that correlationbased methods do not lead to causal links between variables. An asymmetry (and hence directions on edges) can only be inferred by considering transcription factor genes as likely sources and other genes as likely targets (see also
chen2008 as a generalisation of agrawal2002 in that it identified transcription factors). In a Gaussian graphical model setting, an attempt to infer directions was made in opgenrhein2007using comparisons between (standardised) partial variances and the notions of relative endogeneous/exogeneous nodes. The rationale of this distinction is that nodes with a higher level of remaining variance are more likely to be source of a directed edge, while those nodes with lower partial variance are more likely to be targets. The simple counterexample in the Box below and in Figure
1 shows how directions can be created artificially. En route to causality, structural equation modelling (SEM, xiong2004 ) assumes that the expression of each gene can be ‘programmed’ as a linear combination of; every other gene expression (for nonzero path coefficients), all other factors, and of a noise term. The last method we mention here is a class of probabilistic graphical model, namely Bayesian Networks. We postpone their detailed presentation until Section 3. To the best of our knowledge, Bayesian Networks were first used for GRN reconstruction by friedman2000 , and their ability to orient some edges and not others is well established causality ; spirtes2005 ; pearl2009 .We consider here the graph in Figure 1 (a), which corresponds to a covariance matrix defined as
Now, assuming enough data is available, the maximum likelihood estimate of the concentration matrix, the inverse of the covariance matrix, which specifies direct relationship in a Gaussian graphical setting, is attained:
Causality
Causality is the intrinsic connection through which one event (the cause) triggers another one (the effect) under certain given conditions. Issues about causality and inductive learning are often ascribed to have originated with Hume’s work on deciphering human cognitive reasoning as a scientific empirical study hume1738 : that causation exists between our own experiments rather than between material facts; something can be learned from data. Major statisticians introduced such concepts in statistical data analysis at the beginning of the 20 century wright1921 ; neyman1923 ; fisher1925 . The modern era of causality started with considerations of randomized vs. non randomized experiments in rubin1974 working with observational data, a review paper linking counterfactual and philosophical considerations holland1986 , and pearl2009 aimed at unifying the different mainstream causal approaches. Science dived into seeking the Effects of Causes (EoC): what is the difference in outcome between the result of applying a treatment, and what would have happened if the treatment had not been applied. The latter of the two questions is referred to as a counterfactual statement. In this framework, the cause is (partly) responsible for the effect, whilst the effect is (partly) dependent on the cause. Note the addition of the word partly here to clearly stress the likely, but not absolutely certain, aspect of the causation mechanism: statistics are concerned with stochastic events, not deterministic laws. Moreover, one cause is not exclusive of other causes for one consequent effect  several causal factors can be the cause of the effect, and can have additive effects or differing interactions with each other. A causal sequence is generated and determined by natural mechanisms, generally several of them. These can be read from a temporal relationship. However, when nature does not explicitly reveal its mechanisms, learning about causality from data is extremely challenging. wainer2014 took a simple example so as to disentangle the research question; whether pupils achieving good results at school are happy because of their performance, or it is their intrinsic happiness which makes them good at school. In such cases, we need some sort of intervention achieved via randomisation of subjects to test hypotheses.
Historically, the language of counterfactual reasoning bottou2013 has been used to define causal relationships from conditional statements (”what if something was different in the past”): if event had have occurred, then would have also occurred. Although the use of the subjunctive mood suggests the antecedent is false, the theory encompasses true or false antecedents. Though not universal, e.g. in a decision theoretic framework dawid2000 , the counterfactual inference mechanism is natural to compute the EoC. causality essentially formalised the use of counterfactuals for statistical causation inference. The ” would have been , had been ” statement means that the relationship can be read as ”if we replace the equation determining by a constant value , then the solution of the equations for variable will be ”. (formally ). This allows the author to define the do
operator and state the property of the modification of the joint distribution under intervention (see Section
3.10). Among the tools used in causal inference, we mention Instrumental Variables tan2006 , Causal Trees athey2016 , Path Analysis wright1921 , Structural Equation Modeling bollen1989 ; tarka2017 , Gcomputation robins1987 , Bayesian DecisionMaking inspired approaches dawid2000 , more recent Kernel Methods lopezpaz2015 , and Probabilistic Causality suppes1970 ; eells1991 (Pearl argued that a confusion was first made with the statement that cause increases probability causality , whereas some assumptions need be checked to link the two notions). Applications of statistical causality range from Psychology buchsbaum2012 , Epidemiology greenland1999 , Social Research verkuyten2002 , Economics spirtes2005 , Project Management cardenas2017 , Marketing gupta2008 , Human kleinberg2011 and Veterinary martin2014 Medicine.We now integrate the concepts of causality into systems biology and gene networks. We will not address the relationships from marker to phenotype, although the kind of relationship can be similar, the causality necessarily originates from the genome wu2010 . frommlet2016 introduces highdimensional techniques for Quantitative Trait Locus (QTL) detection in recent data sets in which genegene interactions are accounted for. The perspective is more a pragmatical one for breeding research to improve traits of interest, rather than to dissect the molecular regulations. eQTL causal networks could be considered if we had genomic data in addition to transcriptomics rakitsch2016 .
Causality in networks
The concept of causality in gene network reconstruction is about distinguishing “direct” regulatory relationships brazhnik2002 between biological components of the network from “association”. Causal inference methods for gene network reconstruction are at the cornerstone of biological knowledge discovery, and may indeed be useful in solving biomedical problems. Randomised trials hu2014 in this context are highly impractical, very costly, and even unethical in most instances guyon2010 . Gene coexpression is arguably not enough djordjevic2014 , and most of the time methods rely on timeseries anjum2009 ; rau2010 ; shojaie2010 ; deng2012 ; krouk2013 ; dondelinger2013 , perturbation cai2013 ; krouk2013 (sometimes combined with observational data rau2013 ; monneret2017 ), or genetical genomics liu2008 ; allouche2013 ; tasaki2015 experiments. We will instead focus purely on observational (aka steadystate wildtype) data. As an example, an application of predicting the phenotype accounting for gene interaction and obtaining ranges of causal effects from a theoretical point of view can be found in maathuis2010 with the R pcalg companion package in kalisch2012 .
Our choice is to test Bayesian Networks (BNs) for their inherent ease of representation of a complex system, with an associated probabilistic distribution under some mild assumptions. gupta2008 proposed that Bayesian networks have a limited applicability in the reconstruction of causal relationships. Methods for BN inference are reviewed in koski2012 and in bnR . Arrows in these graphical networks are more like lines on the sketch of a construction plan  they encode conditional independence relationships, which by definition are symmetrical. Bayesian network arrows need not represent causal relationships, in fact, not even ’influencial’ relationships. It is for this reason that it is perfectly valid to use prognostic or diagnostic models; the former estimates the chances of developing an outcome, and the latter estimates the chances of having this outcome hendriksen2013 . This distinction is also that between Effect of Cause (EoC, e.g. ’ is in state . Will a change in have an effect on ?’) or Cause of Effect (CoE, e.g. ’ has changed its value. Is it because was altered?’)dawid2016 . With the exception of courthouse business, most causal questions  at least in science  are concerned with EoC via adequate designs of experiments sebastiani2011 . CoE questions are hindered by situations where appropriate confounding effects are difficult to disentangle, perhaps due to missing information (for example missing variables sadeh2013 ). In this case, typically, one would need strong prior knowledge, or to make strong assumptions to allow the access of bounds on “Probability of Cause” to be estimated. To return to similar problems in gene network reconstruction, the reader is directed to buhlmann2014 .
Causal interpretation of Bayesian Networks via node intervention
Directed Acyclic Graphs (DAGs) like the one in Figure 3 encode conditional independence relationships between the variables at the vertices. For instance, in this graph, and are independent, conditional on and . and however, are not independent nor conditionally independent on any other subset of nodes. In a pure graph theoretic approach, directed edges (parent/child relations), should not go through any interpretation beyond these independence relationships. Scorebased or independence learning algorithms seek graph structures and conditional probabilities which optimally fit the data. On the frequentist side, there is no reason so as to prefer a member of the equivalence class of a learned graph (see Section 3). On the Bayesian side, prior beliefs can steer our confidence towards one structure or another. Regardless, one is often tempted to say that a directed edge stands for some kind of causal dependence of on . PC asserted that one can retrieve causal relationships from observational data by resorting to considering directed edges of the essential graph. On the other hand, koski2012 claimed that taking the reasoning from immoralities to a causal interpretation is a fallacy. Unless there is clear knowledge in the system, causal statements are often acceptably left vague, and they can only be clarified with the vocabulary of interventions in DAGs pearl2009 . This extends the meaning of the independencies encoded in classical graphical models to an enriched set of distributions under interventions on nodes. An intervention is defined as an operator setting a variable to a specified distribution. The most classical intervention one can think of is to impose a Dirac distribution to one variable  thereby setting it to some fixed value. Other distributions can be ’set’ to any node in a DAG.
From these considerations, we aim to verify whether learning algorithms could infer simulated causal relationships, and in which settings. In other words: what can and cannot be learned from observational data in terms of causality in a ’typical’ biological system?
3 Bayesian Networks as a framework for DAG inference
Bayesian Networks are a now widely used and powerful tool for probabilistic modelling. In this section we introduce this kind of graphical models and discuss their important features.
3.1 Graphs and DAGs
A graph, , is a mathematical construct consisting of a pair, , of a set of nodes (also known as vertices), , and a set of edges, . Each edge itself consists of a pair of nodes representing the endpoints of that particular edge. For example, an edge from node to node would be denoted .
A DAG, or Directed Acyclic Graph, is a graph whose edges are directed; i.e. an edge goes from a to b, not both ways; and which contains no cycles, i.e. for any node , there exists no nontrivial path which both begins and ends at .
If we consider each node as representing a particular event, and each edge as a conditional dependence, we begin to see how a DAG may be used as a graphical model. For example, consider a DAG with node set {(F = You run out of fuel), (L = You’re late)}, and edge set {(F, L)}. We then have a twonode DAG with one edge going from F to L.
3.2 dSeparation
The importance and use of dseparation will become obvious later, for now we simply provide the definition. Judea Pearl, considered by some to be the forefather of modern causal ideologies, defines dseparation as follows (causality, , p.1617):
If , , and are disjoint subsets in a DAG , then is said to dseparate from … if along every path between a node in and a node in there is a node satisfying one of the following conditions;
 1.
has converging arrows, and none of or its descendants are in
 2.
does not have converging arrows and is in .
3.3 Probabilistic Independence
Two variables, and , are said to be probabilistically independent given a third variable , denoted , if
(1) 
Where
(2) 
3.4 Dmaps, imaps, and Perfect Maps
A DAG, , is an independency map, or imap
, of some probability distribution
if and only if dseparation of nodes in implies probabilistic independence of the corresponding variables in , i.e.(3) 
Conversely, a DAG, is a dependency map, or dmap, if and only if
(4) 
A DAG, , is a perfect map of if and only if it is both a dmap and an imap of .
3.5 The Markov Blanket
The Markov blanket of a node, , in a DAG, , denoted , is the minimal set of nodes conditioned on which becomes independent of the rest of the nodes in .
It can be shown (probReasoning, , p.120121) that if satisfies the Markov Condition each node of the DAG is conditionally independent of its nondescendants given its parent variables, consists of the parents, children, and children’s parents of . is illustrated in Figure 2.
3.6 Bayesian Network definition
A Bayesian Network (BN) is a minimal imap of some probability distribution ; that is, a directed acyclic graph, , whose nodewise graphical separations imply probabilistic independencies in , and where any further edge removal from will result in loss of its imapness (probReasoning, , p.119).
A Bayesian Network, , may be expressed as a pair consisting of a DAG, , and a set of parameters for the conditional probabilities of the nodes in , denoted .
Many extensions to the Bayesian Network exist; for example, Objectoriented Bayesian networks koller1997 can be used to form compound models for more complex situations when some elementary blocks are repeated and hierarchically assembled.
Why are they useful?
To begin, notice that in the definition for a BN, we only required to be the set of parameters for the conditional probabilities for each node, not the entire joint distribution. This in itself simplifies expression of the full system. We also have that a BN satisfies the Markov Condition, and as such the maximum set of nodes that a node needs to be conditioned on is simply its Markov Blanket. This can drastically reduce the number of values required to fully specify the model.
3.7 Bayesian Network Learning: An Overview
A sample from a Bayesian Network, , consists of a realisation of each node in . For example, consider the Bayesian Network in Figure 3 as a classic example.
Let node denote the event that it is summer, node denote the event that rain is falling, node that your garden sprinkler is active, and that the grass is wet. A sample from this network could look something like:
Many algorithms exist to reproduce either the structure or probability distribution associated with a BN from a set of such samples, and can be summarised as in the following sections.
3.8 Structure Learning
Learning the structure of a BN is a complex task. The number of possible acceptable edge configurations grows superexponentially as the number of nodes increases, for twonode networks there are 3 possible valid DAG structures. For fournodes there are structures, and for eight nodes there are structures. For a reasonable twentynode network there are possible valid edge configurations Stephen
. This makes exploring the entire search space of plausible (and large enough to be useful) BNs an impractical exercise. Following are overviews, and examples, of the three main heuristic approaches intended to deal with this problem.
ConstraintBased Algorithms
The first general approach is using constraintbased algorithms, which are usually based on Pearl’s IC (Inductive Causation) algorithm (causality, , p.50). With these, we start with a fully saturated, undirected graph and use conditional independence tests such as the loglikelihood or a modified Pearson’s test (bnR, , p.99) to remove edges onebyone. Another pass then adds direction to edges where adequate information is available (see Pearl’s algorithm (causality, , p.50)). Examples of constraint based algorithms include IC (causality, , p.50), Incremental Association Markov blanket (IAMB) IAMB , and PC (PC, , p.84).
ScoreBased Algorithms
The other common approach is one which considers and scores the network as a whole, rather than one edge at a time. Initialise any network of your choosing, then possible future networks are evaluated by providing them with a score; such as the Bayesian Dirichlet Equivalent uniform (BDE), or the Bayesian Information Criterion (BIC) score (bnR, , p.106).
Hybrid Algorithms
Perhaps most commonly used are hybrid algorithms, which, as the name would suggest, are a combination of the two previous approaches. For example, the Sparse Candidate algorithm SparseCandidate uses tests for independence to restrict the search space for the subsequent scorebased search. This general approach is known as the restrictmaximise heuristic, and is implemented in bnlearn as rsmax2.
3.9 Parameter Learning
Once the structure of a network is known, the parameters to fit can be estimated relatively easily, using either; a frequentist maximum likelihood approach, whereby you view the global distribution as a function of the model parameters and do some optimisation over the data; or by Bayesian methods (bnR, , §1.4), where you update a prior belief about the parameters in light of the data (bnR, , p.11). The first method, in the case of discrete Bayesian Networks, reduces to simply counting the occurrences of the data points relevant to each node and using the sample proportions to estimate the parameters. The Bayesian approach assigns a uniform prior over the entire conditional probability table, and then calculates the posterior distribution using the data present. Both methods are implemented in bnlearn in the bn.fit function.
Two Notes on Using Bayesian Methods
Firstly, the use of Bayesian Methods may prove valuable when there is expert information available, as the uniform prior mentioned above does not necessarily need to be uniform  it may just as well represent the belief of an expert in the relevant field. To use the Bayesian approach in parameter learning, method = ’bayes’ must be included in the call to bn.fit. This allows for specification of the prior distribution. The user has some control over the weighting of the prior distribution with respect to the data when calculating the posterior distribution by means of the iss (imaginary sample size) argument.
Secondly, a discussing remark must be made on the scalability of the Bayesian network approach. Continuous efforts are made to improve the network size which can be handled. Just over a decade ago, one could tackle a few hundred nodes in a sparse network brown2004 . In the case of discrete data, the algorithm complexity is exponentially linked to the number of classes to represent factor variable levels. When the data is continuous, several choices are possible citefu2005, from prior discretisation to direct modelling, e.g. relying on kernels. In conjunction to discretisation considerations, the maximum number of parents allowed per node constrains the capacity of the algorithm to run on a given data set vignes2011 ; qi2016 . Beyond computational smart decisions, theoretical studies characterise algorithm complexity mengshoel2010 ; decampos2011 since cooper1990 . This leads to the development of more efficient algorithms handa2003 ; malone2011 ; adabor2015 . Sometimes, developments rely on parallel computation nikolova2012 ; madsen2017 . Bayesian networks can be learnt with thousands of variables and few tens of thousands of edges using most current statistical packages. More specific applications can allow practitioners to perform computations with hundred of thousands of nodes, e.g. to perform variable selection in databases thibault2009 , but this is not yet the case nor a need for biological networks, where a few tens of thousands of nodes is amply enough, and the limiting factor is then rather sample sizes.
3.10 Intervention
Intervening on a network and setting the value of a node has a notably different effect than simply ‘observing’ it. Consider again the DAG in Figure 3, with semantics as defined in Section 3.7.
Were an external deity to intervene on the scenario and turn on the sprinkler, whether or not the sprinkler is on becomes independent of what season it is (as it has been forced on), and the DAG structure changes to that in Figure 4.
In this way, intervention renders a node independent of its parents, and the probability distribution underlying the DAG also gets altered. The modification (i) deletes the conditional probability of the nodes intervened on and (ii) sets their value when they are parents of other nodes in the relevant conditional probabilities. Notice that interventions do not necessarily fix the value of a node to a given value, but could impose a particular distribution. It should be fairly clear, then, that you cannot intervene on a set of samples, as they have already been drawn from a particular distribution which it is now too late to alter. For example, if we have a set of samples from the DAG in Figure 3, and we now want to know what would happen if we turned the sprinkler on (regardless of the season), we cannot do so without further experiment or more information about causality in the system.
Causal discovery in observational studies is mainly based on three axioms which bind the causality terminology to probabilistic distributions. The first one is the causal Markov condition. It states that once all direct causes of an event are known, the event is (conditionally) independent of its causal nondescendants. The second one is the faithfulness condition; it ensures that all dependencies observed in the data are structural, i.e. were generated by the structure of an underlying causal graph. These (in)dependencies are not the result of some cancelling combination of model parameters, an event which often has a zero probability measure with commonly used distributions PC . The last condition is termed causal sufficiency assumption; all the common causes of the measured variables are observed. When this is not the case, hidden variable methods can help stegle2010 .
3.11 Markov Equivalence
Consider a twonode network, , with nodes and , and one edge from to , i.e. . With a trivial application of Bayes’ Rule, we observe:
(5)  
(6)  
(7) 
Clearly, some visually nonidentical networks can encode the same information and dependency structures, and samples from them will be indistinguishable. Such networks are known as Markov Equivalent networks. For example he2013 used an MCMC computational approach to identify the equivalence class of a DAG. Calculating the values of and given and may be done as follows:
(8) 
(9) 
4 Experimental setup: quality of a causal network reconstructed from observational data
The aim of our experiment is to investigate the similarity between the BN used to generate samples, and the BN learned from the data, as we varied the number of samples used in the process. This similarity was measured using the Structural Intervention Distance (SID) SID . We provide the code for our experiments at https://github.com/alexW335/BNCausalExperiments/blob/master/SID.R.
4.1 Method
To be able to more easily investigate the problem at hand, we decided to use a small, 7 node BN as shown in Figure 5.
This particular network was chosen as it possesses many substructures of particular interest when investigating causality, for example the vstructure (or collider) in Figure 6(a), the (causal) chain as shown in Figure 6(b), and the fork (or common cause) seen in Figure 6(c).
The conditional probabilities associated with each node were chosen to show distinct separation in the data, e.g. should be sufficiently different to for ease of DAGedge recovery, i.e. we seek high model identifiability (related to faithfulness of the model). These conditional probabilities are available in the source code.
Data Generation
Samples were generated on demand with the R function rbn bnR from the package bnlearn.
Network Reconstruction
The two phase restrictmaximise heuristic was used to recover the DAG structure from the generated data. The semiparametric test was used to check for conditional independencies, the InterAssociative Markov Blanket (IAMB) algorithm IAMB was used in the restriction phase for locally minimising the search space, a straightforward hillclimbing algorithm was used for exploring the restricted space, and the Bayesian Information Criterion (BIC) score BIC was used as the score by which to compare the possible networks during the search.
Initial Notes on BN Comparison
Initially, two measures were considered for comparing the quality and similarity of the reconstructed Bayesian Network, , to the original Bayesian Network :

Graph Edit Distance The graph edit distance between and , where one edit is either an edge addition, deletion, or reversal. This compares the structural similarities of the DAGs, but fails to account for any differences in the underlying distribution.

KullbackLeibler (KL) Divergence KL Divergence from to may be used to compare the underlying probability distributions of the two BNs. As prior information about the structure of the original network is available in an experimental setting such as this, it is possible to condition the calculation of the KL Divergence on a particular node, say ; i.e. enumerate
(10) rather than
(11) Where means sum over all possible states of , e.g. {, }.
This would give an idea of how accurately the probability distribution was being reproduced, and can be used to give some insight into the causal structure underlying the BN in question.
However, a few issues arise from using these measures. The KL Divergence implementation proves particularly difficult to generalise, and the conditioning on some node requires prior knowledge that the specific node selected would be the best node to look for downflow effects.
Bayesian Network Comparison with SID
One metric was used to compare the two models  the Structural Intervention Distance SID
. This does not consider the probability distributions underlying the BNs in such a way that, for example, the KullbackLeibler Divergence does; it rather contains information about the difference between both structures under the same intervention. This allows some information about the causal structure to be inferred, such as the location of vstructures and forks (See Figure
6). In a nutshell, SID measures the number of interventions which lead to different graphical structures between two DAGs. The smaller the SID, the closer the two DAG structures.4.2 Results
SevenNode Network
As expected, the average SID appears to decrease as the number of samples used to generate the network increased, as per Figure 7.
The data shown in Figure 7 can be well modelled using a Generalised Additive Model (GAM), though the usefulness of such a model is debatable. It is assumed that the SID should asymptotically approach zero (or is at least nonnegative), so any model fitted should also exhibit this behaviour to prove itself a useful predictive tool. This aside, the diagnostic plots for the GAM may be found in Figure 8 to convince the reader that the model is at least naively fitting the observed data well.
Another factor of interest is how the confidence in the edges (both true edges, and those which should not appear) grows with the number of samples used to generate the network. This is plotted in Figure 9 where the confidence at each level of sample count is expressed as the proportion of the thousand BNs generated in which each edge appears. As one would expect, the 7 true edges tend to increase in confidence as a function of the number of samples, and the remaining thirty five edges which are not present in the true BN appear to fluctuate randomly close to zero.
The same data used to produce Figure 9 can be used to construct the average adjacency matrix, and thus the average DAG, produced at each sample count. The average DAG for 100 samples and 2500 samples are shown in Figures 10(a) and 10(b) respectively.
The adjacency matrices for the average DAGs provide more information than the simple graphical view, however they can be a little harder to decode as is evident in Figure 11.
FourNode Network
As well as the illustrative sevennode network, the experiments were run on the Bayesian Network whose topology is shown in Figure 3, with conditional probabilities as represented in Figure 12. Note that this network is unrelated to the similarly structured weather modelling Sectionnetwork of Section 3.7.
This yielded a curve of SID by Samples as shown in Figure 13, displaying markedly more variability than the sevennode case.
This variability is echoed in the edge confidence graph as shown in Figure 14, where we can see that an incorrect edge was predicted with higher frequency than one of the actual edges for low to medium sample counts.
An interesting observation is that the collider
is not picked up easily. If incorrectly constructed, any alterations to the collider would have relatively high effects on the SID, as intervention on the central node has a large impact on the interventional distribution. This could explains the variability seen in Figure 13.
A Real Signalling and Regulatory Gene Network
As an illustration, we applied the presented methodology to a real biological network. We chose the Sachs et al. 2005 sachs2005 11node network. We keep the same nomenclature for genes as the one used in the initial paper, except for Plc, which is denoted Plcg here. We only used the samples without intervention. The network which we considered as the reference network for SID comparison is depicted in Figure 17. This network is ideal for our purpose as it has a large number of available samples, and does not include cycles. In addition, it has intervention data sets which could be used to refine the model. Results are presented in Figures 15, 16 and 17.
The SID performance evolution in Figure 15 can seem a bit disappointing. This seemingly poor performance stems from the fact that as many edges are missing from the reconstructed network, the resulting graph finally reaches three disconnected components (PIP2PIP3Plcg, RafMek and PKCPKAJnkP38AktErk). These disconnected components leave many pairs for which the interventional distribution is not adequately estimated, and hence increase SID SID . We see that adding the first edges even increases the computed SID. Forming a more complete sketch of the reference network helps to decrease the SID with increasing sample size. Notice that the empty network (predicted with no data) and the final 8node network have the same SID measure, , although the predicted network is much more satisfying as it comprises 8 edges: five true positives and 3 reversed edges.
Figure 16 presents the edge confidence as a function of sample size. All true positive edges are identified very quickly (with less than 50 samples), except one (PIP3Plcg) which requires more samples (more than 100) to stand out, becoming stable with more than samples. This may be because it competes with another reversed edge (the green lines in Figure 16). Furthermore, it is not clear whether this edge is indeed in this direction or reversed (see discussion in sachs2005 ). Another predicted edge assigned the wrong direction is also competing with the correct edge as per the reference network (PIP3PIP2), and this was reported as bidirectional in the literature study of sachs2005 . Note that the local PIP2PIP3PlcgPIP2 loop does not contradict the DAG assumption, as this representation is that of the average graph created at that number of samples. The true positive edge with weak confidence is not part of the finally inferred network. When it is in the model, PIP2PIP3 is either not present, or is reversed (hence in the correct direction, as per discussion above). The confidence of missed edges do not stand out from the background noise, i.e. the nonrelevant edges or true negatives. Either the local encoded dependency is already captured by the edges, or it requires interventional data to precisely identify them.
Five edges were correctly retrieved, and three additional edges were retrieved in the opposite direction compared to our reference network. This leads to a directed precision of (undirected precision ) and a recall of (undirected recall ). To be noted is that no false positive edge is predicted. No interventional data was used, while sachs2005 used all the 5400 available samples with several subsets (6) of interventions.
5 Conclusion / discussion
In this chapter, we discussed networks as a powerful approach to modelling complex living systems. We focused on gene regulatory networks, and their reconstruction from observational (aka wildtype) gene transcript data. We reviewed some methods dealing with the concept of causality, and we specifically focused on Bayesian Networks. We introduced BNs in detail, and tested their capacity to retrieve causal links on two toy network examples. Unexpectedly, the two experiments lead to different conclusions. While the learning inference went quite well from a causal relationship perspective for the sevennode network, it showed mixed performance in the fournode network. We were not able to see any trend in classification of directed edges which are correctly predicted versus those which are not correctly predicted. In the fournode network, edges originating from the source node were detected with small sample size without any ambiguity, while it would not have been surprising if the learning algorithm had inverted one and not the other (to avoid creating a new Vstructure). In addition, the converging Vstructure to the sink node was much more difficult to retrieve despite relatively large sample sizes (up to 5,000 samples) and contrasting simulated distributions. The conclusion is quite different in our sevennode network, with all edges but one retrieved without ambiguity with more than 500 samples. Again, the most difficult directed edge to retrieve was one involved in a Vstructure, this time from a source node (). It also required fewer samples than the fournode network, while much more structures are possible with 7 nodes. At the same time, this sevennode network contains more conditional independence than the fournode network. Another surprising result is that edges which could have been inferred in one direction or the other ( could be inverted, or the directed path could be inverted; not both at the same time) were assigned a direction without ambiguity. The most surprising conclusion so far in our experiments is that increasing the sample size seemed to always improve the identification of edge, including its direction, whether it is imposed in the equivalence graph or not. This could be due to not studying enough graphical configurations. This conclusion must also be qualified in that obtaining a few hundred samples for a small 10node example is not very realistic from an experimental point of view.
Future work
Obviously, we only showed very limited experiments in very special cases. For our conclusions to be valid, we would need to extend our experimental setup to (i) a bigger variety of conditional distributions, (ii) more local network motifs e.g. feedforward loops, (iii) question the binary variable framework, and (iv) test other learning inference methods. It would be interesting to test point (i) by checking some well chosen edges for their prediction confidence as a function of the conditional probabilities, with a similar representation to that in Figure
9, but a variation on the conditional distribution(s) on the xaxis instead of the sample size. However, varying a conditional distribution with one parameter in one dimension is technically challenging and necessarily not perfect. For point (ii), a next step would be to build a dictionary gathering the typology of local structures which can be encountered in networks and how well a BN learning algorithm could perform at retrieved directed edges on those structures. For point (iv), we and other colleagues geurts2013noted for example that random forests seem to give relevant directions to edges when used to infer GRN (see
huynhthu2010 ).When performing tests to be presented in this chapter, we also used the graph edit and the KL divergence as potential measure of closeness between the true network and the predicted network. It is known that the graph edit distance is not ideal as two graphs with only one edge difference can lead to quite different causal conclusions and/or independence relationships. The KL divergence proved to be useful to compare probability distributions, yet prohibitively difficult to generalise. Combining KLDivergence with the SID metric could provide an alternative representation of the ability of the learning algorithm to retrieve the distribution  if not the causal relationships. An easily generalised metric combining both, or some variation of the two could be of particular interest.
Acknowledgement
Alex White was partly supported by a Summer Scholarship grant from the Institute of Fundamental Sciences at Massey University (NZ). This work was also supported by a visiting professor scholarship from AixMarseille University granted to Matthieu Vignes in 2017. The material in this chapter slowly hatched after many discussions with a multitude of quantitative field colleagues and biologists during some works we were involved in, e.g. vignes2011 ; vandel2012 ; marchand2014 , but most of our inspiration stems from other works we read about and discussed (e.g. chiquet2009 ; huynhthu2010 ; rau2013 ; gagneur2013 ; vallat2013 to cite only a few). We are very appreciative of the many discussions with Stephen Marsland (Victoria University of Wellington, NZ). Lastly, we are very grateful to reviewers of this chapter for their insightful comments.
Bibliography
 (1) Emmanuel S. Adabor, George K. AcquaahMensah, and Francis T. Oduro. Saga: A hybrid search algorithm for bayesian network structure learning of transcriptional regulatory networks. Journal of Biomedical Informatics, 53:27 – 35, 2015.
 (2) Himanshu Agrawal. Extreme selforganization in networks constructed from gene expression data. Physical Review Letters, 89:268702, 2002.
 (3) David Allouche, Christine CiercoAyrolles, Simon de Givry, Gérald Guillermin, Brigitte Mangin, Thomas Schiex, Jimmy Vandel, and Matthieu Vignes. A Panel of Learning Methods for the Reconstruction of Gene Regulatory Networks in a Systems Genetics Context, pages 9–31. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013.
 (4) Gökmen Altay and Frank EmmertStreib. Revealing differences in gene network inference algorithms on the network level by ensemble methods. Bioinformatics, 26(14):1738–1744, 2010.
 (5) Shahzia Anjum, Arnaud Doucet, and Chris C. Holmes. A boosting approach to structure learning of graphs with and without prior knowledge. Bioinformatics, 25(22):2929–2936, 2009.
 (6) Susan Athey and Guido Imbens. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences of the United States of America, 113(27):7353–7360, 2016.
 (7) AlbertLászló Barabási and Zoltán N. Oltvai. Network biology: understanding the cell’s functional organization. Nature Reviews Genetics, 5(2):101–113, 2004.
 (8) Juliette Blanchet and Matthieu Vignes. A modelbased approach to gene clustering with missing observation reconstruction in a markov random field framework. Journal of Computational Biology, 16(3):475–486, 2009.
 (9) Kenneth A. Bollen. Structural Equations with Latent Variables. John Wiley & Sons, NewYork, USA, 1989.
 (10) Gianluca Bontempi, Benjamin HaibeKains, Christine Desmedt, Christos Sotiriou, and John Quackenbush. Multipleinput multipleoutput causal strategies for gene selection. BMC Bioinformatics, 12(1):458, 2011.

(11)
Léon Bottou, Jonas Peters, Joaquin Quiñonero Candela, Denis X. Charles,
D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and
Ed Snelson.
Counterfactual reasoning and learning systems: The example of
computational advertising.
Journal of Machine Learning Research
, 14:3207–3260, 2013.  (12) Paul Brazhnik, Alberto de la Fuente, and Pedro Mendes. Gene networks: how to put the function in genomics. Trends in Biotechnology, 20(11):467–472, 2002.
 (13) Laura E. Brown, Ionnis Tsamardinos, and Constantin F. Aliferis. A novel algorithm for scalable and accurate bayesian network learning. In Proceedings of 11th World Congress in Medical Informatics (MEDINFO ’04), volume 107, pages 711–715, 2004.
 (14) Daphna Buchsbaum, Sophie Bridgers, Deena Skolnick Weisberg, and Alison Gopnik. The power of possibility: causal learning, counterfactual reasoning, and pretend play. Philosophical Transactions of the Royal Society B: Biological Sciences, 367(1599):2202––2212, 2012.
 (15) Peter Bühlmann, Markus Kalisch, and Lukas Meier. Highdimensional statistics with a view toward applications in biology. Annual Review of Statistics and Its Application, 1(1):255–278, 2014.

(16)
Peter Bühlmann and Sara van de Geer.
Statistics for HighDimensional Data – Methods, Theory and Applications
. SpringerVerlag, Berlin, Heidelberg, 2011.  (17) Xiaodong Cai, Juan Andrés Bazerque, and Georgios B. Giannakis. Inference of gene regulatory networks with sparse structural equation models exploiting genetic perturbations. PLOS Computational Biology, 9(5):1–13, 2013.

(18)
Ibsen Chivata Cardenas, Hans Voordijk, and Geert Dewulf.
Beyond theory: Towards a probabilistic causation model to support project governance in infrastructure projects.
International Journal of Project Management, 35(3):432–450, 2017.  (19) Venkat Chandrasekaran, Pablo A. Parrilo, and Alan S. Willsky. Latent variable graphical model selection via convex optimization. Ann. Statist., 40(4):1935–1967, 08 2012.
 (20) Guanrao Chen, Peter Larsen, Eyad Almasri, and Yang Dai. Rankbased edge reconstruction for scalefree genetic regulatory networks. BMC Bioinformatics, 9(1):75, 2008.
 (21) Julien Chiquet, Aleander Smith, Gilles Grasseau, Catherine Matias, and Christophe Ambroise. SIMoNe: Statistical Inference for MOdular NEtworks. Bioinformatics, 25(3):417–418, 2009.
 (22) Diego Colombo, Marloes H. Maathuis, Markus Kalisch, and Thomas S. Richardson. Learning highdimensional directed acyclic graphs with latent and selection variables. The Annals of Statistics, 40(1):294–321, 2012.
 (23) Gregory F. Cooper. The computational complexity of probabilistic inference using bayesian belief networks. Artificial Intelligence, 42(2):393–405, 1990.
 (24) A. Philip Dawid. Causal inference without counterfactuals. Journal of the American Statistical Association, 95(450):407–424, 2000.
 (25) A. Philip Dawid, Monica Musio, and Stephen E. Fienberg. From statistical evidence to evidence of causality. Bayesian Analysis, 11(3):725–752, 2016.
 (26) Cassio P. De Campos. New complexity results for map in bayesian networks. In Proceedings of the TwentySecond International Joint Conference on Artificial Intelligence  Volume Volume Three, IJCAI’11, pages 2100–2106. AAAI Press, 2011.
 (27) Hidde de Jong. Modeling and simulation of genetic regulatory systems: A literature review. Journal of Computational Biology, 9(1):67–103, 2004.
 (28) Mo Deng, Amin Emad, and Olgica Milenkovic. Causal compressive sensing for gene network inference, pages 696–699. IEEE, 2012.
 (29) Djordje Djordjevic, Andrian Yang, Armella Zadoorian, Kevin Rungrugeecharoen, and Joshua W.K. Ho. How difficult is inference of mammalian causal gene regulatory networks? PLOS ONE, 9(11):1–10, 2014.
 (30) Frank Dondelinger, Sophie Lèbre, and Dirk Husmeier. Nonhomogeneous dynamic bayesian networks with bayesian regularization for inferring gene regulatory networks with gradually timevarying structure. Machine Learning, 90(2):191–230, 2013.
 (31) PierreYves Dupont, Carla J. Eaton, Jason J. Wargent, Susanne Fechtner, Peter Solomon, Jan Schmid, Robert C. Day, Barry Scott, and Murray P. Cox. Fungal endophyte infection of ryegrass reprograms host metabolism and alters development. New Phytologist, 208(4):1227–1240, 2015.
 (32) Frederick Eberhardt, Clark Glymour, and Richard Scheines. On the number of experiments sufficient and in the worst case necessary to identify all causal relations among n variables. In Proceedings of the TwentyFirst Conference on Uncertainty in Artificial Intelligence, UAI’05, pages 178–184, Arlington, Virginia, United States, 2005. AUAI Press.
 (33) Jeremy S. Edwards and Bernard O. Palsson. Systems properties of the haemophilus influenzae rd metabolic genotype. The Journal of Biological Chemistry, 274(25):17410–17416, 1999.
 (34) Ellery Eells. Probabilistic Causality. Cambridge University Press, Cambridge, MA, USA, 1970.
 (35) Frank EmmertStreib, Galina Glazko, Altay Göokmen, and Ricardo De Matos Simoes. Statistical inference and reverse engineering of gene regulatory networks from observational expression data. Frontiers in Genetics, 3:8, 2012.
 (36) Julia C. Engelmann, Thomas Amann, Birgitta OttRötzer, Margit Nützel, Yvonne Reinders, Jörg Reinders, Wolfgang E. Thasler, Theresa Kristl, Andreas Teufel, Christian G. Huber, Peter J. Oefner, Rainer Spang, and Claus Hellerbrand. Causal modeling of cancerstromal communication identifies pappa as a novel stromasecreted factor activating nfκb signaling in hepatocellular carcinoma. PLoS Computational Biology, 11(5):1–22, 2015.
 (37) Franck F. EmmertStreib and Gökmen Altay. Local networkbased measures to assess the inferability of different regulatory networks. IET Systems Biology, 4:277–288, 2010.
 (38) Ronald A. Fisher. Statistical methods for research workers. Oliver & Boyd, Edinburgh, Scotland, 1925.
 (39) Nir Friedman, Michal Linial, Iftach Nachman, and Dana Pe’er. Using bayesian networks to analyze expression data. Journal of Computational Biology, 7(34):601–620, 2000.
 (40) Nir Friedman, Kevin Murphy, and Stuart Russell. Learning the structure of dynamic probabilistic networks. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, UAI’98, pages 139–147, San Francisco, CA, USA, 1998. Morgan Kaufmann Publishers Inc.
 (41) Nir Friedman, Iftach Nachman, and Dana Peér. Learning bayesian network structure from massive datasets: the sparse candidate algorithm. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, pages 206–215. Morgan Kaufmann Publishers Inc., 1999.
 (42) Florian Frommlet, Malgorzata Bogdan, and David Ramsey. Phenotypes and Genotypes. Springer, 2016.
 (43) Fei Fu and Qing Zhou. Learning sparse causal gaussian networks with experimental intervention: Regularization and coordinate descent. Journal of the American Statistical Association, 108(501):288–300, 2013.
 (44) Nicoló Fusi, Oliver Stegle, and Neil D. Lawrence. Joint modelling of confounding factors and prominent genetic regulators provides increased accuracy in genetical genomics studies. PLOS Computational Biology, 8(1):1–9, 2012.
 (45) Julien Gagneur, Oliver Stegle, Chenchen Zhu, Petra Jakob, Manu M Tekkedil, Raeka S Aiyar, AnnKathrin Schuon, Dana Pe’er, and Lars M Steinmetz. Genotypeenvironment interactions reveal causal pathways that mediate genetic effects on phenotype. PLoS genetics, 9(9):e1003803, 2013.
 (46) Niklas Markus Gericke and Mariana Hagberg. Definition of historical models of gene function and their relation to students’ understanding of genetics. Science & Education, 16(7):849–881, 2007.
 (47) Pierre Geurts and Vân Anh HuynhThu. personal communication, 2013.
 (48) Zoubin Ghahramani. Learning dynamic bayesian networks. In Adaptive Processing of Sequences and Data Structures, Lecture Notes in Computer Sciences, pages 168–197. SpringerVerlag, 1998.
 (49) Christophe Giraud. Introduction to HighDimensional Statistics. Chapman & Hall/CRC, New York, 2014.
 (50) Sander Greenland, Judea Pearl, and James M. Robins. Causal diagrams for epidemiologic research. Epidemiology, 10(1):37–45, 1999.
 (51) Sumeet Gupta and Hee W. Kim. Linking structural equation modeling to bayesian networks: Decision support for customer retention in virtual communities. European Journal of Operational Research, 190(3):818 – 833, 2008.

(52)
H. Handa and O. Katai.
Estimation of bayesian network algorithm with ga searching for better
network structure.
In
IProceedings of the 2003 nternational Conference on Neural Networks and Signal Processing
, volume 1, pages 436–439, 2003.  (53) F. Harary, R.Z. Norman, and D. Cartwright. Structural models: an introduction to the theory of directed graphs. Wiley, 1965.

(54)
Alain Hauser and Peter Bühlmann.
Two optimal strategies for active learning of causal models from interventional data.
International Journal of Approximate Reasoning, 55(4):926 – 939, 2014. Special issue on the sixth European Workshop on Probabilistic Graphical Models.  (55) Yangbo He, Jinzhu Jia, and Bin Yu. Reversible mcmc on markov equivalence classes of sparse directed acyclic graphs. The Annals of Statistics, 41(4):1742–1779, 2013.
 (56) J. M. T. Hendriksen, G. J. Geersing, K. G. M. Moons, and J. A. H. de Groot. Diagnostic and prognostic prediction models. Journal of Thrombosis and Haemostasis, 11:129–141, 2013.
 (57) Ute Hentschel, Michael Steinert, and Jörg Hacker. Common molecular mechanisms of symbiosis and pathogenesis. Trends in Microbiology, 8(5):226 – 231, 2000.
 (58) Steven M. Hill, Laura M Heiser, […], and Sach Mukherjee. Inferring causal molecular networks: empirical assessment through a communitybased effort. Nature Methods, 13(4):310–318, 2016.
 (59) Paul W. Holland. Statistics and causal inference. Journal of the American Statistical Association, 81(396):945–960, 1986.
 (60) Huining Hu, Zhentao Li, and Adrian R Vetta. Randomized experimental design for causal graph discovery. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2339–2347. Curran Associates, Inc., 2014.
 (61) David Hume. A Treatise of Human Nature. John Noon, London, UK, 17381740.
 (62) Dirk Husmeier. Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic bayesian networks. Bioinformatics, 19(17):2271–2282, 2003.
 (63) Vân Anh HuynhThu, Alexandre Irrthum, Louis Wehenkel, and Pierre Geurts. Inferring regulatory networks from expression data using treebased methods. PLOS ONE, 5:1–10, 2010.
 (64) Isabelle Isabelle Guyon, Dominik Janzing, and Bernhard Schölkopf. Causality: Objectives and assessment. In Isabelle Guyon, Dominik Janzing, and Bernhard Schölkopf, editors, Proceedings of Workshop on Causality: Objectives and Assessment at NIPS 2008, volume 6 of Proceedings of Machine Learning Research, pages 1–42, Whistler, Canada, 2010.
 (65) François Jacob and Jacques Monod. Genetic regulatory mechanisms in the synthesis of proteins. Journal of Molecular Biology, 3(3):318 – 356, 1961.
 (66) Markus Kalisch, Martin Mächler, Diego Colombo, Marloes H. Maathuis, and Peter Bühlmann. Causal inference using graphical models with the r package pcalg. Journal of Statistical Software, 47:11, 2012.
 (67) Hiroaiki Kitano. Systems biology: A brief overview. Science, 295(2):1662–1664, 2002.
 (68) Samantha Kleinberg and George Hripcsak. A review of causal inference for biomedical informatics. Journal of Biomedical Informatics, 44(6):1102 – 1112, 2011.
 (69) Daphne Koller and Avi Pfeffer. Objectoriented bayesian networks. In Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence, UAI’97, pages 302–313, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc.
 (70) Timo J.T. Koski and John M. Noble. A review of bayesian networks and structure learning. Mathematica Applicanda, 40(1):53–103, 2012.
 (71) Gabriel Krouk, Jesse Lingeman, Amy Marshall Colon, Gloria Coruzzi, and Shasha Denis. Gene regulatory networks in plants: learning causality from time and perturbation. Genome Biology, 14(6):123, 2013.

(72)
Gustavo Lacerda, Peter Spirtes, Joseph Ramsey, and Patrik Hoyer.
Discovering cyclic causal models by independent components analysis.
In Proceedings of the TwentyFourth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI08), pages 366–374, Corvallis, Oregon, 2008. AUAI Pres.  (73) WeiPo Lee and WenShyong Tzou. Computational methods for discovering gene networks from expression data. Briefings in Bioinformatics, 10(4):408–423, 2009.
 (74) Bing Liu, Alberto de la Fuente, and Ina Hoeschele. Gene network inference via structural equation modeling in genetical genomics experiments. Genetics, 178(3):1763–1776, 2008.
 (75) David LopezPaz, Krikamol Muandet, Bernhard Schölkopf, and Iliya Tolstikhin. Towards a learning theory of causeeffect inference. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1452–1461, Lille, France, 2015.
 (76) Marloes H. Maathuis, Diego Colombo, Markus Kalisch, and Peter Bühlmann. Predicting causal effects in largescale systems from observational data. Nature Methods, 7:47–48, 2012.
 (77) Marloes H. Maathuis, Markus Kalisch, and Peter Bühlmann. Estimating highdimensional intervention effects from observational data. The Annals of Statistics, 37(6A):3133–3164, 2009.
 (78) Anders L. Madsen, Frank Jensen, Antonio Salmerón, Helge Langseth, and Thomas D. Nielsen. A parallel algorithm for bayesian network structure learning from large data sets. KnowledgeBased Systems, 117:46–55, 2017.
 (79) Brandon Malone, Changhe Yuan, Eric A. Hansen, and Susan Bridges. Improving the scalability of optimal bayesian network learning with externalmemory frontier breadthfirst branch and bound search. In Proceedings of the TwentySeventh Conference on Uncertainty in Artificial Intelligence, UAI’11, pages 479–488, Arlington, Virginia, United States, 2011. AUAI Press.
 (80) Shmoolik Mangan and Uri Alon. Structure and function of the feedforward loop network motif. Proceedings of the National Academy of Sciences, 100(21):11980–11985, 2003.
 (81) Daniel Marbach, James C Costello, Robert Küffner, Nicole M. Vega, Robert J. Prill, Diogo M. Camacho, Kyle R Allison, The DREAM5 Consortium, Manolis Kellis, James J. Collins, and Gustavo Stolovitzky. Wisdom of crowds for robust gene network inference. Nature Methods, 9(8):796–804, 2014.
 (82) Daniel Marbach, Robert J. Prill, Thomas Schaffter, Claudio Mattiussi, Dario Floreano, and Gustavo Stolovitzky. Revealing strengths and weaknesses of methods for gene network inference. Proceedings of the National Academy of Sciences, 107(14):6286–6291, 2010.
 (83) Gwenaëlle Marchand, Vân Anh HuynhThu, Nolan C. Kane, Sandrine Arribat, Didier Varès, David Rengel, Sandrine Balzergue, Loren H. Rieseberg, Patrick Vincourt, Pierre Geurts, Matthieu Vignes, and Nicolas B. Langlade. Bridging physiological and evolutionary timescales in a gene regulatory network. New Phytologist, 203(2):685–696, 2014.
 (84) Florian Markowetz and Rainer Spang. Inferring cellular networks – a review. BMC Bioinformatics, 8(6):S5, 2007.

(85)
Stephen Marsland.
Machine learning : an algorithmic perspective.
Chapman & HallCRC machine learning & pattern recognition series. Boca Raton : CRC Press, 2nd edition, 2015.
 (86) Wayne Martin. Making valid causal inferences from observational data. Preventive Veterinary Medicine, 113(3):281–297, 2014. Special Issue: Schwabe Symposium 2012.
 (87) Richard McElreath. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Chapman and Hall/CRC, Boca Raton, FL, USA, 2015.
 (88) Nicolai Meinshausen, Alain Hauser, Joris M. Mooij, Jonas Peters, Philip Versteeg, and Peter Bühlmann. Methods for causal inference from gene perturbation experiments and validation. Proceedings of the National Academy of Sciences, 113(27):7361–7368, 2016.
 (89) Ole J. Mengshoel. Understanding the scalability of bayesian network inference using clique tree growth curves. Artificial Intelligence, 174(12):984–1006, 2010.
 (90) Pablo Meyer, Thomas Cokelaer, Deepak Chandran, Kyung Hyuk Kim, PoRu Loh, George Tucker, Mark Lipson, Bonnie Berger, Clemens Kreutz, Andreas Raue, Bernhard Steiert, Jens Timmer, Erhan Bilal, Herbert M. Sauro, Gustavo Stolovitzky, and Julio SaezRodriguez. Network topology and parameter estimation: from experimental design methods to gene regulatory network kinetics using a community based approach. BMC Systems Biology, 8(1):13, 2014.
 (91) George Michailidis and Florence d’Alché Buc. Autoregressive models for gene regulatory network inference: Sparsity, stability and causality issues. Mathematical Biosciences, 246(2):326–334, 2013.
 (92) Gilles Monneret, Florence Jaffrézic, Andrea Rau, Tatiana Zerjal, and Grégory Nuel. Identification of marginal causal relationships in gene networks from observational and interventional expression data. PLOS ONE, 12(3):1–13, 2017.
 (93) Joris M. Mooij, Dominik Janzing, Tom Heskes, and Bernhard Schölkopf. On causal discovery with cyclic additive noise models. In J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 639–647. Curran Associates Inc., 2011.
 (94) Joris M. Mooij, Jonas Peters, Dominik Janzing, Jakob Zscheischler, and Bernhard Schölkopf. Distinguishing cause from effect using observational data: Methods and benchmarks. Journal of Machine Learning Research, 17(32):1–102, 2016.
 (95) Fantine Mordelet and JeanPhilippe Vert. Sirene: supervised inference of regulatory networks. Bioinformatics, 24(16):i76–i82, 2008.
 (96) Robert O. Ness, Karen Sachs, and Olga Vitek. From correlation to causality: Statistical approaches to learning regulatory relationships in largescale biomolecular investigations. Journal of Proteome Research, 15(3):683–690, 2016.

(97)
Jerzy Neyman.
On the application of probability theory to agricultural experiments. essay on principles. section 9 (translated and edited by d. m. dabrowska and t. p. speed from the polish original, which appeared in roczniki nauk rolniczych tom x (1923) 151 (annals of agricultural science).
Statistical Science, 5(4):465–472, 1990.  (98) O. Nikolova and S. Aluru. Parallel bayesian network structure learning with application to gene networks. In High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for, pages 1–9, 2012.
 (99) Denis Noble. The Music of Life: Biology beyond genes. Oxford University Press, 2006.
 (100) Chris J. Oates, Frank Dondelinger, Nora Bayani, James Korkola, Joe W. Gray, and Sach Mukherjee. Causal network inference using biochemical kinetics. Bioinformatics, 30(17):i468–i474, 2014.
 (101) Chris. J. Oates and Sach Mukherjee. Network inference and biological dynamics. The Annals of Applied Statistics, 6(3):1209–1235, 2012.
 (102) Rainer OpgenRhein and Korbinian Strimmer. From correlation to causation networks: a simple approximate learning algorithm and its application to highdimensional plant gene expression data. BMC Systems Biology, 1(1):37, 2007.
 (103) Judea Pearl. Probabilistic reasoning in intelligent systems : networks of plausible inference. The Morgan Kaufmann series in representation and reasoning. San Francisco, Calif. : Morgan Kaufmann Publishers, 1988.
 (104) Judea Pearl. Causal inference in statistics: An overview. Statistics Surveys, 3(1):96–146, 2009.
 (105) Judea Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, New York, NY, USA, 2nd edition, 2009.
 (106) Elizabeth Pennisi. Dna study forces rethink of what it means to be a gene. Science, 316(5831):1556–1557, 2007.
 (107) Jonas Peters and Peter Bühlmann. Structural intervention distance for evaluating causal graphs. Neural Computation, 27(3):771 – 779, 2015.
 (108) Xiaolong Qi, Yinhuan Shi, Hao Wang, and Yang Gao. Grouping parallel bayesian network structure learning algorithm based on variable ordering. In Hujun Yin, Yang Gao, Bin Li, Daoqiang Zhang, Ming Yang, Yun Li, Frank Klawonn, and Antonio J. TallónBallesteros, editors, Intelligent Data Engineering and Automated Learning – IDEAL 2016, pages 405–415, Cham, 2016. Springer International Publishing.
 (109) John Quackenbush. Extracting biology from highdimensional biological data. Journal of Experimental Biology, 210(9):1507–1517, 2007.
 (110) Barbara Rakitsch and Oliver Stegle. Modelling local gene networks increases power to detect transacting genetic effects on gene expression. Genome Biology, 17(1):33, 2016.
 (111) Andrea Rau, Florence Jaffrézic, JeanLouis Foulley, and Rebecca W. Doerge. An empirical bayesian method for estimating biological networks from temporal microarray data. Statistical Applications in Genetics and Molecular Biology, 9:1, 2010.
 (112) Andrea Rau, Florence Jaffrézic, and Grégory Nuel. Joint estimation of causal effects from observational and intervention gene expression data. BMC Systems Biology, 7(1):111, 2013.
 (113) Andrea Rau, Cathy MaugisRabusseau, MarieLaure MartinMagniette, and Gilles Celeux. Coexpression analysis of highthroughput transcriptome sequencing data with poisson mixture models. Bioinformatics, 31(9):1420–1427, 2015.
 (114) James M. Robins. A graphical approach to the identification and estimation of causal parameters in mortality studies with sustained exposure periods. Journal of chronic diseases, 40(Suppl 2):139S–161S, 1987.
 (115) Don Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66:688–701, 1974.
 (116) Karen Sachs, Omar Perez, Dana Pe’er, Douglas A. Lauffenburger, and Garry P. Nolan. Causal proteinsignaling networks derived from multiparameter singlecell data. Science, 308(5721):523–529, 2005.
 (117) Mohammad J. Sadeh, Giusi Moffa, and Rainer Spang. Considering unknown unknowns: Reconstruction of nonconfoundable causal relations in biological networks. Bayesian Anal., 11(20):920–932, 2013.
 (118) Gideon Schwarz. Estimating the dimension of a model. Ann. Statist., 6(2):461–464, 03 1978.
 (119) Marco Scutari and JeanBaptiste Denis. Bayesian networks : with examples in R. Texts in statistical science. Boca Raton; London; New York: CRC Press: Taylor & Francis Group, 2014.
 (120) Marco Scutari, Phil Howell, David J. Balding, and Ian Mackay. Multiple quantitative trait analysis using bayesian networks. Genetics, 198(1):129–137, 2014.
 (121) Mario Scutari. Learning Bayesian Networks with the bnlearn R Package. Journal of Statistical Software, 35(3):1–22, 2010.
 (122) Paola Sebastiani, Jacqui Milton, and Ling Wang. Designing Microarray Experiments, pages 271–290. Springer, Boston, MA, USA, 2011.
 (123) Ali Shojaie, Alexandra Jauhiainen, Michael Kallitsis, and George Michailidis. Inferring regulatory networks by combining perturbation screens and steady state gene expression profiles. PLOS ONE, 9:1–16, 02 2014.
 (124) Ali Shojaie and George Michailidis. Discovering graphical granger causality using the truncating lasso penalty. Bioinformatics, 26(18):i517–i523, 2010.
 (125) Peter Spirtes. Graphical models, causal inference, and econometric models. Journal of Economic Methodology, 12(1):3–34, 2005.
 (126) Peter Spirtes, Clark N. Glymour, and Richard Scheines. Causation, Prediction, and Search, volume 2nd ed. Peter Spirtes, Clark Glymour, and Richard Scheines ; with additional material by David Hecke of Adaptive Computation and Machine Learning. A Bradford Book, 2000.
 (127) Oliver Stegle, Dominik Janzing, Kun Zhang, Joris M Mooij, and Bernhard Schölkopf. Probabilistic latent variable models for distinguishing between cause and effect. In J. D. Lafferty, C. K. I. Williams, J. ShaweTaylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 1687–1695. Curran Associates, Inc., 2010.
 (128) Patrick Suppes. A Probabilistic Theory of Causality. NorthHolland Publishing, Amsterdam, NL, 1970.
 (129) Zhiqiang Tan. Regression and weighting methods for causal inference using instrumental variables. Journal of the American Statistical Association, 101(476):1607–1618, 2006.
 (130) Piotr Tarka. An overview of structural equation modeling: its beginnings, historical development, usefulness and controversies in the social sciences. Quality & Quantity, 2017.
 (131) Franziska Taruttis, Rainer Spang, and Julia C. Engelmann. A statistical approach to virtual cellular experiments: improved causal discovery using accumulation ida (aida). Bioinformatics, 31(23):3807–3814, 2015.
 (132) Shinya Tasaki, Ben Sauerwine, Bruce Hoff, Hiroyoshi Toyoshiba, Chris Gaiteri, and Elias Chaibub Neto. Bayesian network reconstruction using systems genetics data: Comparison of MCMC methods. Genetics, 199(4):973–989, 2015.
 (133) Arthur Tenenhaus, Vincent Guillemot, Xavier Gidrol, and Vincent Frouin. Gene association networks from microarray data using a regularized estimation of partial correlation based on pls regression. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 7(2):251–262, 2010.

(134)
Grégory Thibault, Alex Aussem, and Stéphane Bonnevay.
Incremental bayesian network learning for scalable feature selection.
In Niall M. Adams, Céline Robardet, Arno Siebes, and JeanFrançois Boulicaut, editors, Advances in Intelligent Data Analysis VIII, pages 202–212, Berlin, Heidelberg, 2009. Springer.  (135) Ioannis Tsamardinos, Constantin F Aliferis, Alexander R Statnikov, and Er Statnikov. Algorithms for large scale markov blanket discovery. In FLAIRS conference, volume 2, pages 376–380, 2003.
 (136) Alexander L. Tulupyev and Sergey I. Nikolenko. Directed cycles in bayesian belief networks: Probabilistic semantics and consistency checking complexity. In Alexander Gelbukh, Álvaro de Albornoz, and Hugo TerashimaMarín, editors, MICAI 2005: Advances in Artificial Intelligence, pages 214–223, Berlin, Heidelberg, 2005. Springer.
 (137) Laurent Vallat, Corey A. Kemper, Nicolas Jung, Myriam MaumyBertrand, Frédéric Bertrand, Nicolas Meyer, Arnaud Pocheville, John W. Fisher, John G. Gribben, and Seiamak Bahram. Reverseengineering the genetic circuitry of a cancer cell with predicted intervention in chronic lymphocytic leukemia. Proceedings of the National Academy of Sciences, 110(2):459–464, 2013.
 (138) Jimmy Vandel, Brigitte Mangin, Matthieu Vignes, Damien Leroux, Olivier Loudet, MarieLaure MartinMagniette, and Simon De Givry. Gene regulatory network inference with extended scores for bayesian networks. Revue d’Intelligence Artificielle, 26(6):679–708, 2012.
 (139) Maykel Verkuyten and Jochem Thijs. School satisfaction of elementary school children: The role of performance, peer relations, ethnicity and gender. Social Indicators Research, 59(2):203–228, 2002.
 (140) Nicolas Verzelen. Minimax risks for sparse regressions: Ultrahigh dimensional phenomenons. Electronic Journal of Statistics, 6:38–90, 2012.
 (141) Matthieu Vignes, Jimmy Vandel, David Allouche, Nidal RamadanAlban, Christine CiercoAyrolles, Thomas Schiex, Brigitte Mangin, and Simon de Givry. Gene regulatory network reconstruction using bayesian networks, the dantzig selector, the lasso and their metaanalysis. PLOS ONE, 6(12):1–15, 2011.
 (142) Howard Wainer. Visual revelations: Happiness and causal inference. CHANCE, 27(4):61–64, 2014.
 (143) Adriano V. Werhli, Marco Grzegorczyk, and Dirk Husmeier. Comparative evaluation of reverse engineering gene regulatory networks with relevance networks, graphical Gaussian models and Bayesian networks. Bioinformatics, 22(20):2523–2531, 2006.
 (144) Adriano V. Werhli and Dirk Husmeier. Reconstructing gene regulatory networks with bayesian networks by combining expression data with multiple sources of prior knowledge. Statistical Applications in Genetics and Molecular Biology, 6:1, 2007.
 (145) Sewall Wright. Correlation and causation. Journal of agricultural research, 20(7):557–585, 1921.
 (146) Rongling Wu and George Casella. Statistical genetics  associating genotypic differences with measurable outcomes. In J.M. Tanur, editor, Statistics: A Guide to the Unknown, pages 243–254, 2010.
 (147) Momiao Xiong, Jun Li, and Xiangzhong Fang. Identification of genetic networks. Genetics, 166(2):1037–1052, 2004.
 (148) Nicolae Radu Zabet. Negative feedback and physical limits of genes. Journal of Theoretical Biology, 284(1):82 – 91, 2011.
 (149) Bin Zhang and Steve Horvath. A general framework for weighted gene coexpression network analysis. Statistical Applications in Genetics and Molecular Biology, 4:17, 2005.
Comments
There are no comments yet.