1 Introduction
Knowledge bases (KB) constructed through information extraction from text play an important role in query answering and reasoning. Since automatic extraction methods yield results of varying quality, typically, only highconfidence extractions are retained, which ensures precision at the expense of recall. Consequently, the noisy extractions are prone to propagating false negatives when used for further reasoning. However, in many problems, empirical observations of entities, or observational data, are readily available, potentially recovering information when fused with noisy extractions. For reasoning tasks where both empirical observations and extractions can be obtained, an open and critical problem is designing methods that exploit both modes of identification.
In this work, we study a particular reasoning task, the problem of discovering causal relationships between entities, known as causal discovery. There are two contrasting types of approaches to discovering causal knowledge. One approach attempts to identify causal relationships from text using automatic extraction techniques, while the other approach infers causation from observational data. For example, prior extractionbased approaches have mined causal links such as regulatory relationships among genes directly from scientific text Poon et al. (2014); Song and Chen (2009). However, the extracted links often miss complex and longerrange patterns that require observational data. On the other hand, given observations alone, extensive work has studied the problem of inferring a network of causeandeffect relationships among variables Spirtes and Glymour (1991); Chickering (1996). Observational data such as gene expression measurements are used to infer causal relationships such as gene regulation Magliacane et al. (2016). Prior approaches use constraints to find valid causal orientations from observational data Hyttinen et al. (2013, 2014); Magliacane et al. (2016); Spirtes and Glymour (1991); Claassen and Heskes (2011). Although the constraints offer attractive soundness guarantees, the need for observed measurements of variables remains costly and prohibitive when experimental data is unpublished. Extractions such as interactions between genes mined directly from text provide a coarse approximation of unseen observational data. Combining extractions mined from KBs with observed measurements where available to for can alleviate the cost of obtaining experimentbased data.
We propose an approach for fusing noisy extractions with observational data to discover causal knowledge. We introduce , a probabilistic model over causal relationships that combines commonly used constraints over observational data with extractions obtained from a KB. uses the probabilistic soft logic (PSL) modeling framework to express causal constraints in a natural logical syntax that flexibly incorporates both observational and KB modes of evidence. As our main contributions:

We introduce the novel problem of combining noisy extractions from a KB with observational data.

We propose a principled approach that uses wellstudied constraints to recover longrange patterns and consistent predictions, while cheaply acquired extractions provide a proxy for unseen observations.

We apply our method gene regulatory networks and show the promise of exploiting KB signals in causal discovery, suggesting a critical, new area of research.
We compare with a conventional logicbased approach that uses only observational data to perform . We evaluate both methods on transcriptional regulatory networks of yeast. Our results validate two strengths of our approach: 1) achieves comparable performance with the wellstudied conventional method, suggesting that noisy extractions are useful approximations for unseen empirical evidence; and 2) global logical constraints over observational data enforce consistency across predictions and bolster to perform on par with the competing method. The results suggest promising new directions for integrating knowledge bases in causal reasoning, potentially mitigating the need for expensive observational data.
2 Background on Logical Causal Discovery
The inputs to traditional causal discovery methods are independent observations of variables . The problem of causal discovery is to infer a directed acyclic graph (DAG) such that each edge corresponds to being a direct cause of , where changing the value of always changes the value of .
Since graphical model encodes conditional independences among , algorithms exploit the mapping between observed independences in the data and paths in to specify constraints on the output. The PC algorithm Spirtes and Glymour (1991) is a canonical such method that performs independence tests on the observations to rule out invalid causal edges. Constraints over causal graph structure can also be encoded with logic Hyttinen et al. (2013, 2014); Magliacane et al. (2016). In a logical system, independence relations are represented as logical atoms. Logical atoms consist of a predicate symbol with variable or constant arguments and take boolean or continuous truth values. To avoid confusion with logical variables, for the remainder of this paper, we refer to as vertices. As inputs to logical , we require the following predicates to represent the outcomes of independence tests among :

, refers to statistical (in)dependence between vertices and as measured by the independence test . The conditioning set is the empty set.

, corresponds to statistical (in)dependence between vertices and when conditioned on set . The independence test is performed.
The outputs are of a logical system are represented by the following target predicates:

refers to the absence or presence of a causal edge between and , and is substituted with all pairs of vertices . Finding truth value assignments to these atoms is the goal of causal discovery.

corresponds to the absence or presence of an ancestral edge between all vertices and , where is an ancestor of if there is a directed causal path from to . We may additionally infer the truth values of ancestral atoms jointly with causal atoms.
Given the independence tests over as input, the goal of logical is to find consistent assignments to the causal and ancestral output atoms.
3 Using Extractions in Causal Discovery
In the problem of fusing noisy extractions with , in addition to the observations, we are given a set of variables of evidence from knowledge base (KB) , where is an affinity score of the interaction between and based on text extraction.
Extending previous logical methods, we additionally represent in the predicate set with . corresponds to and denotes the absence or presence of an undirected edge, or adjacency, between and as extracted from text. Evidence of adjacencies is critical to inference of . However, adjacencies in standard are inferred from statistical tests alone. In our approach, we replace statistical adjacencies with . The goal of fusing KB evidence in logical is to find maximally satisfying assignments to the unknown causal atoms based on constraints over both independence and textbased signals. In section 4, we present a probabilistic logic approach defining constraints using statistical and KB evidence.
4 A Probabilistic Approach to Inferring Causal Knowledge
Our approach uses probabilistic soft logic (PSL) Bach et al. (2017)
to encode constraints for causal discovery. A key advantage of PSL is exact and efficient MAP inference for finding most probable assignments. We first review PSL and then present our novel encoding constraints that combine statistical and KB information.
4.1 Probabilistic Soft Logic
PSL is a probabilistic programming framework where random variables are represented as logical atoms and dependencies between them are encoded via rules in firstorder logic. Logical atoms in PSL take continuous values and logical satisfaction of the rule is computed using the Lukasiewicz relaxation of Boolean logic. This relaxation into continuous space allows MAP inference to be formulated as a convex optimization problem that can be solved efficiently.
Given continuous evidence variables and unobserved variables , PSL defines the following Markov network, called a hingeloss Markov random field (HLMRF), over continuous assignments to :
(1) 
where is a normalization constant, and is an efficienttooptimize hingeloss feature function that scores configurations of assignments to and as a linear function of the variable assignments.
An HLMRF is defined by PSL model , a set of weighted disjunctions, or rules, where is the weight of th rule. Rules consist of logical atoms and are called ground rules if only constants appear in the atoms. To obtain the HLMRF, we first substitute logical variables appearing in with constants from observations, producing ground rules. We observe truth values for a subset of the ground atoms, and infer values for the remaining unobserved ground atoms, . The ground rules and their corresponding weights map to and . To derive , the Lukasiewicz relaxation is applied to each ground rule to derive a hinge penalty function over for violating the rule. Thus, MAP inference minimizes the weighted rule penalties to find the minimally violating joint assignment for all the unobserved variables:
PSL uses the consensus based ADMM algorithm to perform exact MAP inference.
4.2
extends constraints introduced by the PC algorithm Spirtes and Glymour (1991). Whereas PC infers adjacencies from conditional independence tests, uses textbased adjacency evidence in all causal constraints. The textbased adjacency evidence bridges domainknowledge contained in KBs with statistical tests that propagate causal information.
Figure 1 shows all the rules used in . The first set of rules follow directly from the three constraints introduced by PC. We additionally introduce joint rules to induce dependencies between ancestral and causal structures to propagate consistent predictions. We describe below how rules upgrade PC to combine KB and statistical signals for causal discovery.
PCinspired Rules
PC uses conditional (in)dependence and adjacency to rule out violating causal orientations. However, in , all adjacencies are directly mined from a KB. Rule C1 discourages causal edges between vertices that are not adjacent based on evidence in text. Rule C2 penalizes simple cycles between two vertices. Rules C3 and C4 capture the first PC rule and orient chain as , a vstructure, based on independence criteria. Rule C5 orients path as to avoid orienting additional vstructures. Rule C6 maps to the third PC rule, and if and , orients to avoid a cycle. PC applies these rules iteratively to fix edges whereas in , the rules induce dependencies between causal edges to encourage parsimonious joint inferences.
Joint Rules
Joint rules encourage consistency across ancestral and causal predictions through constraints such as transitivity that follow from basic definitions. Rule J1 encodes that causal edges are also ancestral by definition and rule J2 is the contrapositive that penalizes causal edges to nondescendants. Rule J3 encodes transitivity of ancestral edges, encouraging consistency across predictions. Rule J4 infers causal edges between probable ancestral edges that are adjacent based on textual evidence. Rule J5 orients chain as a diverging path when is not likely an ancestor of . Joint rules give preference to predicted structures that respect both ancestral and causal graphs.
In our evaluation, we investigate the implications of using a noisy extractionbased proxy for adjacency and the benefits of joint modeling.
Rule Type  Rules 

PCinspired Rules  C1) 
C2)  
C3)  
C4)  
C5)  
C6)  
Joint Rules  J1) 
J2)  
J3)  
J4)  
J5) 
5 Experimental Evaluation
Our experiments investigate the two main claims of our approach:

We study whether the noisy extractions are a suitable proxy for latent adjacencies and give similar performance to a conventional logicbased approach that impute adjacency values using only observations.

We understand the role of joint ancestral and causal rules over observational data in mitigating noise from the extractionbased evidence.
We evaluate on realworld gene regulatory networks in yeast. We compare against , the PSL model variant that performs prototypical causal discovery using only observational data. replaces TextAdj with StandardAdj, adjacencies computed from conditional independence tests.
5.1 Data
Our dataset for evaluation consists of a transcriptional regulatory network across 300 genes in yeast with simulated gene expression from the DREAM4 challenge Marbach et al. (2010); Prill et al. (2010). We snowball sample 10 smaller subnetworks of sizes 20 with low Jaccard overlap to perform cross validation. The data contains 210 gene expression measurements simulated from differential equation models of the system. We perform independence tests on the realvalued measurements which are known to contribute numerous spurious correlations. In addition to the gene expression data, we model domain knowledge based on undirected proteinprotein interaction (PPI) edges extracted from the Yeast Genome Database:
We obtain textbased affinity scores of interaction between pairs of yeast genes from the STRING database. STRING finds mentions of gene or protein names across millions of scientific articles and computes the cooccurrence of mentions between genes. As an additional step, STRING extracts relations between genes and increases the affinity score if genes are connected by salient terms such as “binds to” or “phosphorylates.”
5.2 Results
Model Variant  

0.19 0.08  
0.20 0.05  
0.17 0.07  
0.19 0.05 
Adjacency  Precision  Recall 

TextAdj  0.32  0.11 
StandardAdj  0.27  0.3 
We evaluate and using 10fold cross validation on DREAM4 networks. uses the same rules as our approach but computes StandardAdj as ground atoms that never appear in groundings of , based on definition.
To evaluate the additional benefit of joint rules, we compare submodels of and run with causal orientation rules only, denoted and respectively. Table 1 shows average scores of all model variants for the regulatory network prediction task on DREAM4.
Noisy Extractions Maintain Performance
First, we see comparable performance between and , answering our first experimental question on how closely noisy extractions approximate adjacencies. In table 1, there is no statistically significant difference between the scores of and . The comparable performance between and suggests that the noisy extractions can substitute observational data computations without significantly degrading performance.
Joint Rules Overcome Noise
Our investigation into model variants sheds light on the second experimental question around how logical rules overcome the noise from extractions. When comparing PConly variants of each method, gains over , suggesting that sophisticated joint rules are needed to mitigate the noise from KB extractions. The consistency across predictions encouraged by the joint rules bolsters the extractionbased adjacency signal.
Extractions Yield Higher Precision, Lower Recall
To further investigate the extraction evidence mined from STRING, we compare both StandardAdj and TextAdj against goldstandard adjacencies, which we obtain from undirected regulatory links. Table 2
shows the average precision and recall of each adjacency evidence type across the DREAM4 subnetworks. Interestingly,
TextAdj achieves higher precision than its statistical counterpart. However, StandardAdj gains over TextAdj in recall. The result further substantiates the benefit of joint modeling in recovering additional orientations under lowrecall inputs. Nonetheless, the comparison points to the need for a deeper understanding of the role KBs play in causal reasoning.5.3 Experiment Details
To obtain marginal and conditional (in)dependence tests, we use linear and partial correlations with Fisher’s Z transformation. We condition on all sets up to size two. We set rule weights for both PSL models to 5.0 except for rule C2 which is set to 10.0, since it encodes a strong acyclicity constraint. Both models use an
threshold on the value to categorize independence tests as or . We select with 10fold cross validation. We hold out each subnetwork in turn and use the best average score across the other subnetworks to pick raised to powers . selects two different values for binning independence tests and computing adjacencies, and requires a single for tests only. We also select rounding thresholds for both PSL models within the same crossvalidation framework. Since is typically small, we rescale truth values for byto reduce rightskewness of values. We rescale all STRING affinity scores to be between 0 and 1.
6 Related Work
Our work extends constraintbased methods to , most notably the PC algorithm Spirtes and Glymour (1991), which first infers adjacencies and maximally orients them using deterministic rules based on conditional independence. PC only supports external evidence in the form of fixed edges or nonedges. Our work is motivated by recent approaches that cast as a SAT instance over conditional independence statements Hyttinen et al. (2014); Magliacane et al. (2016); Hyttinen et al. (2013). SATbased approaches are based on logical representations that more readily admit additional constraints and relations from domain knowledge. However, so far, logical methods use external evidence to identify probable edges.
In a separate vein, prior work has extended textmining to identify regulatory networks and genetic interactions only from scientific literature RodríguezPenagos et al. (2007); Song and Chen (2009); Poon et al. (2014). In contrast, our goal is to propose techniques that leverage both statistical test signals and text evidence. The work most similar to ours combines gene expression data with evidence mined from knowledge bases to infer gene regulatory networks Chouvardas et al. (2016). However, the regulatory network inference orients edges using hardcoded knowledge of transcription factors instead of reasoning about causality. In our approach, we propose a principled causal discovery formulation as the basis of incorporating KB evidence.
7 Discussion and Future Work
In this work, we present an initial approach for reasoning with noisy extractionbased evidence directly in a logical system. We benefit from a flexible logical formulation that supports replacing conventional adjacencies computed from observational data with cheaply obtained extractions. Our evaluation suggests that the noisy KBbased proxy signal achieves comparable performance to conventional methods. The promising result points to future research in exploiting KBs for causal reasoning, greatly mitigating the need for costly observational data. We see many directions of future work, including better extraction strategies for mining scientific literature and finding textbased proxies for additional statistical test signals. KBs could provide ontological constraints or semantic information useful for causal reasoning. We additionally plan to study knowledgebased constraints for causal discovery.
References

Bach et al. [2017]
Stephen H. Bach, Matthias Broecheler, Bert Huang, and Lise Getoor.
Hingeloss markov random fields and probabilistic soft logic.
Journal of Machine Learning Research (JMLR)
, 2017. To appear. 
Chickering [1996]
David Maxwell Chickering.
Learning bayesian networks is npcomplete.
In Learning from data, pages 121–130. Springer, 1996.  Chouvardas et al. [2016] Panagiotis Chouvardas, George Kollias, and Christoforos Nikolaou. Inferring active regulatory networks from gene expression data using a combination of prior knowledge and enrichment analysis. BMC bioinformatics, 17(5):181, 2016.
 Claassen and Heskes [2011] Tom Claassen and Tom Heskes. A logical characterization of constraintbased causal discovery. In UAI, 2011.
 Hyttinen et al. [2013] Antti Hyttinen, Patrik O Hoyer, Frederick Eberhardt, and Matti Jarvisalo. Discovering cyclic causal models with latent variables: A general satbased procedure. In UAI, 2013.
 Hyttinen et al. [2014] Antti Hyttinen, Frederick Eberhardt, and Matti Järvisalo. Constraintbased causal discovery: Conflict resolution with answer set programming. In UAI, 2014.
 Magliacane et al. [2016] Sara Magliacane, Tom Claassen, and Joris M Mooij. Ancestral causal inference. In NIPS, 2016.
 Marbach et al. [2010] Daniel Marbach, Robert J Prill, Thomas Schaffter, Claudio Mattiussi, Dario Floreano, and Gustavo Stolovitzky. Revealing strengths and weaknesses of methods for gene network inference. Proceedings of the national academy of sciences, 107:6286–6291, 2010.
 Poon et al. [2014] Hoifung Poon, Chris Quirk, Charlie DeZiel, and David Heckerman. Literome: Pubmedscale genomic knowledge base in the cloud. Bioinformatics, 30(19):2840–2842, 2014.
 Prill et al. [2010] Robert J Prill, Daniel Marbach, Julio SaezRodriguez, Peter K Sorger, Leonidas G Alexopoulos, Xiaowei Xue, Neil D Clarke, Gregoire AltanBonnet, and Gustavo Stolovitzky. Towards a rigorous assessment of systems biology models: the dream3 challenges. PloS one, 5:e9202, 2010.

RodríguezPenagos et al. [2007]
Carlos RodríguezPenagos, Heladia Salgado, Irma MartínezFlores, and
Julio ColladoVides.
Automatic reconstruction of a bacterial regulatory network using natural language processing.
BMC bioinformatics, 8(1):293, 2007.  Song and Chen [2009] YongLing Song and SuShing Chen. Text mining biomedical literature for constructing gene regulatory networks. Interdisciplinary Sciences: Computational Life Sciences, 1(3):179–186, 2009.
 Spirtes and Glymour [1991] Peter Spirtes and Clark Glymour. An algorithm for fast recovery of sparse causal graphs. Social science computer review, 9:62–72, 1991.
 Szklarczyk et al. [2017] Damian Szklarczyk, John H Morris, Helen Cook, Michael Kuhn, Stefan Wyder, Milan Simonovic, Alberto Santos, Nadezhda T Doncheva, Alexander Roth, Peer Bork, et al. The string database in 2017: qualitycontrolled protein–protein association networks, made broadly accessible. Nucleic acids research, 45(D1):D362–D368, 2017.
Comments
There are no comments yet.