Using Noisy Extractions to Discover Causal Knowledge

11/16/2017 ∙ by Dhanya Sridhar, et al. ∙ University of Maryland University of California Santa Cruz 0

Knowledge bases (KB) constructed through information extraction from text play an important role in query answering and reasoning. In this work, we study a particular reasoning task, the problem of discovering causal relationships between entities, known as causal discovery. There are two contrasting types of approaches to discovering causal knowledge. One approach attempts to identify causal relationships from text using automatic extraction techniques, while the other approach infers causation from observational data. However, extractions alone are often insufficient to capture complex patterns and full observational data is expensive to obtain. We introduce a probabilistic method for fusing noisy extractions with observational data to discover causal knowledge. We propose a principled approach that uses the probabilistic soft logic (PSL) framework to encode well-studied constraints to recover long-range patterns and consistent predictions, while cheaply acquired extractions provide a proxy for unseen observations. We apply our method gene regulatory networks and show the promise of exploiting KB signals in causal discovery, suggesting a critical, new area of research.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Knowledge bases (KB) constructed through information extraction from text play an important role in query answering and reasoning. Since automatic extraction methods yield results of varying quality, typically, only high-confidence extractions are retained, which ensures precision at the expense of recall. Consequently, the noisy extractions are prone to propagating false negatives when used for further reasoning. However, in many problems, empirical observations of entities, or observational data, are readily available, potentially recovering information when fused with noisy extractions. For reasoning tasks where both empirical observations and extractions can be obtained, an open and critical problem is designing methods that exploit both modes of identification.

In this work, we study a particular reasoning task, the problem of discovering causal relationships between entities, known as causal discovery. There are two contrasting types of approaches to discovering causal knowledge. One approach attempts to identify causal relationships from text using automatic extraction techniques, while the other approach infers causation from observational data. For example, prior extraction-based approaches have mined causal links such as regulatory relationships among genes directly from scientific text Poon et al. (2014); Song and Chen (2009). However, the extracted links often miss complex and longer-range patterns that require observational data. On the other hand, given observations alone, extensive work has studied the problem of inferring a network of cause-and-effect relationships among variables Spirtes and Glymour (1991); Chickering (1996). Observational data such as gene expression measurements are used to infer causal relationships such as gene regulation Magliacane et al. (2016). Prior approaches use constraints to find valid causal orientations from observational data Hyttinen et al. (2013, 2014); Magliacane et al. (2016); Spirtes and Glymour (1991); Claassen and Heskes (2011). Although the constraints offer attractive soundness guarantees, the need for observed measurements of variables remains costly and prohibitive when experimental data is unpublished. Extractions such as interactions between genes mined directly from text provide a coarse approximation of unseen observational data. Combining extractions mined from KBs with observed measurements where available to for  can alleviate the cost of obtaining experiment-based data.

We propose an approach for fusing noisy extractions with observational data to discover causal knowledge. We introduce , a probabilistic model over causal relationships that combines commonly used constraints over observational data with extractions obtained from a KB.  uses the probabilistic soft logic (PSL) modeling framework to express causal constraints in a natural logical syntax that flexibly incorporates both observational and KB modes of evidence. As our main contributions:

  1. We introduce the novel problem of combining noisy extractions from a KB with observational data.

  2. We propose a principled approach that uses well-studied  constraints to recover long-range patterns and consistent predictions, while cheaply acquired extractions provide a proxy for unseen observations.

  3. We apply our method gene regulatory networks and show the promise of exploiting KB signals in causal discovery, suggesting a critical, new area of research.

We compare  with a conventional logic-based approach that uses only observational data to perform . We evaluate both methods on transcriptional regulatory networks of yeast. Our results validate two strengths of our approach: 1)  achieves comparable performance with the well-studied conventional method, suggesting that noisy extractions are useful approximations for unseen empirical evidence; and 2) global logical constraints over observational data enforce consistency across predictions and bolster  to perform on par with the competing method. The results suggest promising new directions for integrating knowledge bases in causal reasoning, potentially mitigating the need for expensive observational data.

2 Background on Logical Causal Discovery

The inputs to traditional causal discovery methods are independent observations of variables . The problem of causal discovery is to infer a directed acyclic graph (DAG) such that each edge corresponds to being a direct cause of , where changing the value of always changes the value of .

Since graphical model encodes conditional independences among ,  algorithms exploit the mapping between observed independences in the data and paths in to specify constraints on the output. The PC algorithm Spirtes and Glymour (1991) is a canonical such method that performs independence tests on the observations to rule out invalid causal edges. Constraints over causal graph structure can also be encoded with logic Hyttinen et al. (2013, 2014); Magliacane et al. (2016). In a logical  system, independence relations are represented as logical atoms. Logical atoms consist of a predicate symbol with variable or constant arguments and take boolean or continuous truth values. To avoid confusion with logical variables, for the remainder of this paper, we refer to as vertices. As inputs to logical , we require the following predicates to represent the outcomes of independence tests among :

  1. , refers to statistical (in)dependence between vertices and as measured by the independence test . The conditioning set is the empty set.

  2. , corresponds to statistical (in)dependence between vertices and when conditioned on set . The independence test is performed.

The outputs are of a logical  system are represented by the following target predicates:

  1. refers to the absence or presence of a causal edge between and , and is substituted with all pairs of vertices . Finding truth value assignments to these atoms is the goal of causal discovery.

  2. corresponds to the absence or presence of an ancestral edge between all vertices and , where is an ancestor of if there is a directed causal path from to . We may additionally infer the truth values of ancestral atoms jointly with causal atoms.

Given the independence tests over as input, the goal of logical  is to find consistent assignments to the causal and ancestral output atoms.

3 Using Extractions in Causal Discovery

In the problem of fusing noisy extractions with , in addition to the observations, we are given a set of variables of evidence from knowledge base (KB) , where is an affinity score of the interaction between and based on text extraction.

Extending previous logical  methods, we additionally represent in the predicate set with . corresponds to and denotes the absence or presence of an undirected edge, or adjacency, between and as extracted from text. Evidence of adjacencies is critical to inference of . However, adjacencies in standard  are inferred from statistical tests alone. In our approach, we replace statistical adjacencies with . The goal of fusing KB evidence in logical  is to find maximally satisfying assignments to the unknown causal atoms based on constraints over both independence and text-based signals. In section 4, we present a probabilistic logic approach defining constraints using statistical and KB evidence.

4 A Probabilistic Approach to Inferring Causal Knowledge

Our approach uses probabilistic soft logic (PSL) Bach et al. (2017)

to encode constraints for causal discovery. A key advantage of PSL is exact and efficient MAP inference for finding most probable assignments. We first review PSL and then present our novel encoding constraints that combine statistical and KB information.

4.1 Probabilistic Soft Logic

PSL is a probabilistic programming framework where random variables are represented as logical atoms and dependencies between them are encoded via rules in first-order logic. Logical atoms in PSL take continuous values and logical satisfaction of the rule is computed using the Lukasiewicz relaxation of Boolean logic. This relaxation into continuous space allows MAP inference to be formulated as a convex optimization problem that can be solved efficiently.

Given continuous evidence variables and unobserved variables , PSL defines the following Markov network, called a hinge-loss Markov random field (HL-MRF), over continuous assignments to :


where is a normalization constant, and is an efficient-to-optimize hinge-loss feature function that scores configurations of assignments to and as a linear function of the variable assignments.

An HL-MRF is defined by PSL model , a set of weighted disjunctions, or rules, where is the weight of -th rule. Rules consist of logical atoms and are called ground rules if only constants appear in the atoms. To obtain the HL-MRF, we first substitute logical variables appearing in with constants from observations, producing ground rules. We observe truth values for a subset of the ground atoms, and infer values for the remaining unobserved ground atoms, . The ground rules and their corresponding weights map to and . To derive , the Lukasiewicz relaxation is applied to each ground rule to derive a hinge penalty function over for violating the rule. Thus, MAP inference minimizes the weighted rule penalties to find the minimally violating joint assignment for all the unobserved variables:

PSL uses the consensus based ADMM algorithm to perform exact MAP inference.


extends constraints introduced by the PC algorithm Spirtes and Glymour (1991). Whereas PC infers adjacencies from conditional independence tests,  uses text-based adjacency evidence in all causal constraints. The text-based adjacency evidence bridges domain-knowledge contained in KBs with statistical tests that propagate causal information.

Figure 1 shows all the rules used in . The first set of rules follow directly from the three constraints introduced by PC. We additionally introduce joint rules to induce dependencies between ancestral and causal structures to propagate consistent predictions. We describe below how  rules upgrade PC to combine KB and statistical signals for causal discovery.

PC-inspired Rules

PC uses conditional (in)dependence and adjacency to rule out violating causal orientations. However, in , all adjacencies are directly mined from a KB. Rule C1 discourages causal edges between vertices that are not adjacent based on evidence in text. Rule C2 penalizes simple cycles between two vertices. Rules C3 and C4 capture the first PC rule and orient chain as , a v-structure, based on independence criteria. Rule C5 orients path as to avoid orienting additional v-structures. Rule C6 maps to the third PC rule, and if and , orients to avoid a cycle. PC applies these rules iteratively to fix edges whereas in , the rules induce dependencies between causal edges to encourage parsimonious joint inferences.

Joint Rules

Joint rules encourage consistency across ancestral and causal predictions through constraints such as transitivity that follow from basic definitions. Rule J1 encodes that causal edges are also ancestral by definition and rule J2 is the contrapositive that penalizes causal edges to non-descendants. Rule J3 encodes transitivity of ancestral edges, encouraging consistency across predictions. Rule J4 infers causal edges between probable ancestral edges that are adjacent based on textual evidence. Rule J5 orients chain as a diverging path when is not likely an ancestor of . Joint rules give preference to predicted structures that respect both ancestral and causal graphs.

In our evaluation, we investigate the implications of using a noisy extraction-based proxy for adjacency and the benefits of joint modeling.

Rule Type Rules
PC-inspired Rules C1)
Joint Rules J1)
Figure 1: PSL rules for combining statistical tests and KB evidence in causal discovery.

5 Experimental Evaluation

Our experiments investigate the two main claims of our approach:

  1. We study whether the noisy extractions are a suitable proxy for latent adjacencies and give similar performance to a conventional logic-based approach that impute adjacency values using only observations.

  2. We understand the role of joint ancestral and causal rules over observational data in mitigating noise from the extraction-based evidence.

We evaluate  on real-world gene regulatory networks in yeast. We compare against , the PSL model variant that performs prototypical causal discovery using only observational data.  replaces TextAdj with StandardAdj, adjacencies computed from conditional independence tests.

5.1 Data

Our dataset for evaluation consists of a transcriptional regulatory network across 300 genes in yeast with simulated gene expression from the DREAM4 challenge Marbach et al. (2010); Prill et al. (2010). We snowball sample 10 smaller subnetworks of sizes 20 with low Jaccard overlap to perform cross validation. The data contains 210 gene expression measurements simulated from differential equation models of the system. We perform independence tests on the real-valued measurements which are known to contribute numerous spurious correlations. In addition to the gene expression data, we model domain knowledge based on undirected protein-protein interaction (PPI) edges extracted from the Yeast Genome Database:

We obtain text-based affinity scores of interaction between pairs of yeast genes from the STRING database. STRING finds mentions of gene or protein names across millions of scientific articles and computes the co-occurrence of mentions between genes. As an additional step, STRING extracts relations between genes and increases the affinity score if genes are connected by salient terms such as “binds to” or “phosphorylates.”

5.2 Results

Model Variant
0.19 0.08
0.20 0.05
0.17 0.07
0.19 0.05
Table 1:  achieves comparable performance with , suggesting that noisy extractions can approximate unseen adjacencies. Without joint rules,  shows worse performance, pointing to the benefit of sophisticated joint modeling in mitigating noisy extractions.
Adjacency Precision Recall
TextAdj 0.32 0.11
StandardAdj 0.27 0.3
Table 2: Extraction-based adjacencies achieve higher-precision but lower recall, further substantiating need for joint rules in recovering missing causal orientations.

We evaluate  and  using 10-fold cross validation on DREAM4 networks.  uses the same rules as our approach but computes StandardAdj as ground atoms that never appear in groundings of , based on definition.

To evaluate the additional benefit of joint rules, we compare sub-models of  and  run with causal orientation rules only, denoted  and  respectively. Table 1 shows average scores of all model variants for the regulatory network prediction task on DREAM4.

Noisy Extractions Maintain Performance

First, we see comparable performance between  and , answering our first experimental question on how closely noisy extractions approximate adjacencies. In table 1, there is no statistically significant difference between the scores of  and . The comparable performance between  and suggests that the noisy extractions can substitute observational data computations without significantly degrading performance.

Joint Rules Overcome Noise

Our investigation into model variants sheds light on the second experimental question around how logical rules overcome the noise from extractions. When comparing PC-only variants of each method,  gains over , suggesting that sophisticated joint rules are needed to mitigate the noise from KB extractions. The consistency across predictions encouraged by the joint rules bolsters the extraction-based adjacency signal.

Extractions Yield Higher Precision, Lower Recall

To further investigate the extraction evidence mined from STRING, we compare both StandardAdj and TextAdj against gold-standard adjacencies, which we obtain from undirected regulatory links. Table 2

shows the average precision and recall of each adjacency evidence type across the DREAM4 subnetworks. Interestingly,

TextAdj achieves higher precision than its statistical counterpart. However, StandardAdj gains over TextAdj in recall. The result further substantiates the benefit of joint modeling in recovering additional orientations under low-recall inputs. Nonetheless, the comparison points to the need for a deeper understanding of the role KBs play in causal reasoning.

5.3 Experiment Details

To obtain marginal and conditional (in)dependence tests, we use linear and partial correlations with Fisher’s Z transformation. We condition on all sets up to size two. We set rule weights for both PSL models to 5.0 except for rule C2 which is set to 10.0, since it encodes a strong acyclicity constraint. Both models use an

threshold on the -value to categorize independence tests as  or . We select with 10-fold cross validation. We hold out each subnetwork in turn and use the best average score across the other subnetworks to pick raised to powers .  selects two different values for binning independence tests and computing adjacencies, and  requires a single for tests only. We also select rounding thresholds for both PSL models within the same cross-validation framework. Since is typically small, we rescale truth values for  by

to reduce right-skewness of values. We rescale all STRING affinity scores to be between 0 and 1.

6 Related Work

Our work extends constraint-based methods to , most notably the PC algorithm Spirtes and Glymour (1991), which first infers adjacencies and maximally orients them using deterministic rules based on conditional independence. PC only supports external evidence in the form of fixed edges or non-edges. Our work is motivated by recent approaches that cast  as a SAT instance over conditional independence statements Hyttinen et al. (2014); Magliacane et al. (2016); Hyttinen et al. (2013). SAT-based approaches are based on logical representations that more readily admit additional constraints and relations from domain knowledge. However, so far, logical  methods use external evidence to identify probable edges.

In a separate vein, prior work has extended text-mining to identify regulatory networks and genetic interactions only from scientific literature Rodríguez-Penagos et al. (2007); Song and Chen (2009); Poon et al. (2014). In contrast, our goal is to propose techniques that leverage both statistical test signals and text evidence. The work most similar to ours combines gene expression data with evidence mined from knowledge bases to infer gene regulatory networks Chouvardas et al. (2016). However, the regulatory network inference orients edges using hard-coded knowledge of transcription factors instead of reasoning about causality. In our approach, we propose a principled causal discovery formulation as the basis of incorporating KB evidence.

7 Discussion and Future Work

In this work, we present an initial approach for reasoning with noisy extraction-based evidence directly in a logical  system. We benefit from a flexible logical formulation that supports replacing conventional adjacencies computed from observational data with cheaply obtained extractions. Our evaluation suggests that the noisy KB-based proxy signal achieves comparable performance to conventional methods. The promising result points to future research in exploiting KBs for causal reasoning, greatly mitigating the need for costly observational data. We see many directions of future work, including better extraction strategies for mining scientific literature and finding text-based proxies for additional statistical test signals. KBs could provide ontological constraints or semantic information useful for causal reasoning. We additionally plan to study knowledge-based constraints for causal discovery.


  • Bach et al. [2017] Stephen H. Bach, Matthias Broecheler, Bert Huang, and Lise Getoor. Hinge-loss markov random fields and probabilistic soft logic.

    Journal of Machine Learning Research (JMLR)

    , 2017.
    To appear.
  • Chickering [1996] David Maxwell Chickering.

    Learning bayesian networks is np-complete.

    In Learning from data, pages 121–130. Springer, 1996.
  • Chouvardas et al. [2016] Panagiotis Chouvardas, George Kollias, and Christoforos Nikolaou. Inferring active regulatory networks from gene expression data using a combination of prior knowledge and enrichment analysis. BMC bioinformatics, 17(5):181, 2016.
  • Claassen and Heskes [2011] Tom Claassen and Tom Heskes. A logical characterization of constraint-based causal discovery. In UAI, 2011.
  • Hyttinen et al. [2013] Antti Hyttinen, Patrik O Hoyer, Frederick Eberhardt, and Matti Jarvisalo. Discovering cyclic causal models with latent variables: A general sat-based procedure. In UAI, 2013.
  • Hyttinen et al. [2014] Antti Hyttinen, Frederick Eberhardt, and Matti Järvisalo. Constraint-based causal discovery: Conflict resolution with answer set programming. In UAI, 2014.
  • Magliacane et al. [2016] Sara Magliacane, Tom Claassen, and Joris M Mooij. Ancestral causal inference. In NIPS, 2016.
  • Marbach et al. [2010] Daniel Marbach, Robert J Prill, Thomas Schaffter, Claudio Mattiussi, Dario Floreano, and Gustavo Stolovitzky. Revealing strengths and weaknesses of methods for gene network inference. Proceedings of the national academy of sciences, 107:6286–6291, 2010.
  • Poon et al. [2014] Hoifung Poon, Chris Quirk, Charlie DeZiel, and David Heckerman. Literome: Pubmed-scale genomic knowledge base in the cloud. Bioinformatics, 30(19):2840–2842, 2014.
  • Prill et al. [2010] Robert J Prill, Daniel Marbach, Julio Saez-Rodriguez, Peter K Sorger, Leonidas G Alexopoulos, Xiaowei Xue, Neil D Clarke, Gregoire Altan-Bonnet, and Gustavo Stolovitzky. Towards a rigorous assessment of systems biology models: the dream3 challenges. PloS one, 5:e9202, 2010.
  • Rodríguez-Penagos et al. [2007] Carlos Rodríguez-Penagos, Heladia Salgado, Irma Martínez-Flores, and Julio Collado-Vides.

    Automatic reconstruction of a bacterial regulatory network using natural language processing.

    BMC bioinformatics, 8(1):293, 2007.
  • Song and Chen [2009] Yong-Ling Song and Su-Shing Chen. Text mining biomedical literature for constructing gene regulatory networks. Interdisciplinary Sciences: Computational Life Sciences, 1(3):179–186, 2009.
  • Spirtes and Glymour [1991] Peter Spirtes and Clark Glymour. An algorithm for fast recovery of sparse causal graphs. Social science computer review, 9:62–72, 1991.
  • Szklarczyk et al. [2017] Damian Szklarczyk, John H Morris, Helen Cook, Michael Kuhn, Stefan Wyder, Milan Simonovic, Alberto Santos, Nadezhda T Doncheva, Alexander Roth, Peer Bork, et al. The string database in 2017: quality-controlled protein–protein association networks, made broadly accessible. Nucleic acids research, 45(D1):D362–D368, 2017.