Reach Biomedical Information Extraction
Causal precedence between biochemical interactions is crucial in the biomedical domain, because it transforms collections of individual interactions, e.g., bindings and phosphorylations, into the causal mechanisms needed to inform meaningful search and inference. Here, we analyze causal precedence in the biomedical domain as distinct from open-domain, temporal precedence. First, we describe a novel, hand-annotated text corpus of causal precedence in the biomedical domain. Second, we use this corpus to investigate a battery of models of precedence, covering rule-based, feature-based, and latent representation models. The highest-performing individual model achieved a micro F1 of 43 points, approaching the best performers on the simpler temporal-only precedence tasks. Feature-based and latent representation models each outperform the rule-based models, but their performance is complementary to one another. We apply a sieve-based architecture to capitalize on this lack of overlap, achieving a micro F1 score of 46 points.READ FULL TEXT VIEW PDF
We describe challenges and advantages unique to coreference resolution i...
Understanding of causal narratives communicated in clinical notes can he...
Motivation: The proliferation of Biomedical research articles has made t...
Speculation is a naturally occurring phenomena in textual data, forming ...
Claims are a fundamental unit of scientific discourse. The exponential g...
The synthesis process is essential for achieving computational experimen...
Drug-drug interaction (DDI) is a major cause of morbidity and mortality ...
Reach Biomedical Information Extraction
In the biomedical domain, an enormous amount of information about protein, gene, and drug interactions appears in the form of natural language across millions of academic papers. There is a tremendous ongoing effort [Nédellec et al.2013, Kim et al.2012, Kim et al.2009] to extract individual chemical interactions from these texts, but these interactions are only isolated fragments of larger causal mechanisms such as protein signaling pathways. Nowhere, however, including any database, is the complete mechanism described in a form that lends itself to causal search or inference. The absence of such a database is not for lack of trying; Pathway Commons [Cerami et al.2011]
aims to address the need, but its authors estimate it currently covers 1% of the literature due to the high cost of annotation111Personal communication.. This issue only grows more pressing with the yearly growth in biomedical publishing, which presents an otherwise insurmountable challenge for biomedical researchers to query and interpret.
The Big Mechanism program [Cohen2015] aims to construct exactly such large-scale mechanistic information by reading and assembling protein signaling pathways that are relevant for cancer, and exploit them to generate novel explanatory and treatment hypotheses. Although prior work [Chambers et al.2014, Mirza2016] has addressed the challenging area of temporal precedence in the open domain, the biomedical domain presents very different data and, consequently, requires novel techniques. Precedence in mechanistic biology is causal rather than temporal. Though event temporality is crucial to understanding electronic health records for individual patients [Bethard et al.2015, Bethard et al.2016], its contribution to the understanding of biomolecular reactions is less clear as these events and processes may repeat in extremely short cycles, continue without end, or overlap in time. At any level of abstraction, causal precedence encodes mechanistic information and facilitates inference over spotty evidence. For the purpose of this work, precedence is defined for two events, A and B, as
A precedes B if and only if the output of A is necessary for the successful execution of B.222See the “precedes” examples in Table 1.
Very little annotated data exists for causal precedence, especially efforts focusing on signaling pathways. BioCause [Mihăilă et al.2013], for instance, is centered on connections between claims and evidence and contains only 51 annotated examples of causal precedence333These are marked in the BioCause corpus as Causality events with Cause and Effect arguments. The remaining 800 annotations are claim-evidence relations.. Our work444The corpus, tools, and system introduced in this work are publicly available at https://github.com/myedibleenso/this-before-that offers three contributions in aid of automatically extracting causal ordering in biomedical text. First, we provide and describe a dataset of real text examples, manually annotated for causal precedence. Second, we analyze the efficacy of a battery of different models in automatically determining precedence, built on top of the Reach automatic reading system [Valenzuela-Escárcega et al.2015a, Valenzuela-Escárcega et al.2015c]
and measured against this novel corpus. In particular, we investigate three classes of models: (a) deterministic rule-based models inspired by the precedence sieves proposed by chambers2014dense, (b) feature-based models, and (c) models that rely on latent representations such as long short-term memory (LSTM) networks[Hochreiter and Schmidhuber1997]. Our analysis indicates that while independently the top-performing model achieves a micro F1 of 43, these models are largely complementary with a combined recall of 58 points. Lastly, we conduct an error analysis of these models to motivate and inform future research.
|E1 precedes E2||
A is phosphorylated by B.
Following its phosphorylation, A binds with C.
|E2 precedes E1||
A is phosphorylated by B.
Prior to its phosphorylation, A binds with D.
The phosphorylation of A by B.
A is phosphorylated by B.
|E1 specifies E2||
A is phosphorylated by B at Site 123.
A is phosphorylated by B.
|E2 specifies E1||
A is phosphorylated by B.
A is phosphorylated by B at Site 123.
|Other||B does not regulate C when C is bound to A.|
A phosphorylates B.
A ubiquitinates C.
Our corpus annotates several types of relations between mentions of biochemical interactions. Following common terminology promoted by the BioNLP shared tasks, we will interchangeably use “events” to refer to these interactions. To generate candidate events for our planned annotations, we ran the Reach event extraction system [Valenzuela-Escárcega et al.2015a, Valenzuela-Escárcega et al.2015c] over the full text555We chose to ignore the “references”, “materials”, and “methods” sections, which generally do not contain mechanistic information. of 500 biomedical papers taken from the Open Access subset of PubMed666http://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/. The events extracted by Reach are biochemical events of two types: simple events such as phosphorylation that modify one or more entities (typically proteins), and nested events (regulations) that have other events as arguments.
To improve the likelihood of finding pairs of events with a relevant link, we filtered event pairs by imposing the following requirements for inclusion in the corpus:
Event pairs must share at least one participant. This constraint is based on the observation that interactions that share participants are more likely to be connected.
Event pairs must be within 1 sentence of each other. Similarly, discourse proximity increases the likelihood of two events being related.
Event pairs must not share the same type. This helps to maximize the diversity of the dataset.
Event pairs must not already be contained in an extracted Regulation event. For example, we did not annotate the relation between the binding and the phosphorylation events in “The binding of X and Y is inhibited by X phosphorylation”, because it is already captured by most state-of-the-art biomedical event extraction systems.
After applying these constraints, only 1700 event pairs remained. In order to rapidly annotate the event pairs, we developed a browser-based annotation UI that is completely client-side (see Figure 3). Using this tool, we annotated 1000 event pairs for this work; 84 of these were discarded due to severe extraction errors. The annotations include the event spans, event triggers (i.e., the verbal or nominal predicates that indicate the type of interaction such as “binding” or “phosphorylated”), source document, minimal sentential span encompassing both event mentions, and whether or not the event pair involves coreference for either the event trigger or the event participants. For events requiring coreference resolution, we expanded the encompassing span of text to also capture the antecedent. Note that domain-specific coreference resolution is a component of the event extraction system used here [Bell et al.2016].
When describing the relations between these event pairs, we refer to the event that occurs first in text as Event 1 (E1) and the event that follows as Event 2 (E2). Each (E1, E2) pair was assigned one of seven labels: “E1 precedes E2”, “E2 precedes E1”, “Equivalent”, “E1 specifies E2”, “E2 specifies E1”, “Other”, or “None”. Table 1 provides examples for each of these labels. We converged on these labels because they are fundamental to the assembly of causal mechanisms from a collection of events. Collectively, the seven labels address three important assembly tasks: equivalence, i.e., understanding that two event mentions discuss the same event, subsumption, i.e., the two mentions discuss the same event, but one is more specific than the other, and, most importantly, causal precedence, the identification of which is the focus of this work. During the annotation process, we came across examples of other relevant phenomena. We grouped these instances under the label “Other” and leave their analysis for future work.
Though simplified, the examples in Table 3 illustrate that this is a complex task sensitive to linguistic evidence. For example, the direction of the precedence relations in the first two rows in the table changes based on a single word in the context (“prior” vs. “following”).
In terms of the distribution of relations, causal precedence pairs appear more frequently within the same sentence, while cases of the subsumption (“specifies”) and equivalence relations are far more common across sentences (see Figure 1). Coreference is involved in 10–15% of the instances for each relation label (see Figure 2).
The annotation process was performed by two linguists familiar with the biomedical domain. To minimize errors, the annotation task was initially performed together at the same workstation.777Similar to pair programming. On a randomly selected sample of 100 event pairs, the two annotators had a Cohen’s kappa score [Cohen1960] of 0.82, indicating “almost perfect” agreement for the precedes labels [Landis and Koch1977].
We have developed both deterministic, interpretable models and automatic, machine-learning models for detecting causal precedence in our dataset. Importantly, the models covered in this work focus solely on causal precedence, which is the most complex relation annotated in the dataset previously introduced. Thus, for all experiments discussed here, we reduce these annotations to three labels: “E1 precedes E2”, “E2 precedes E1”, andNil, which covers all the other labels in the corpus.
The deterministic models are defined by a small number of hand-written rules using the Odin event extraction framework [Valenzuela-Escárcega et al.2015b]. The number of rules for each model is shown in Table 2, and sharply contrast with the 92,711 features introduced later (Table 3) that are used by our machine-learning models. In order to avoid overfitting, all of the deterministic models were created without reference to the annotation corpus, using general linguistic expertise and domain knowledge.
Within sentences, syntactic regularities can be exploited to cover a large variety of grammatical constructions indicating precedence relations. Rules defined over dependency parses [De Marneffe and Manning2008] capture precedence in sentences like those in (3.1) and (3.1) as well as many others.
The RBD of PI3KC2B binds HRASafter , when HRAS is not bound to GTPbefore The ubiquitination of Abefore is followed by the phosphorylation of Bafter Other phrases captured include: “precedes”, “due to”, “leads to”, “results in”, etc.
Although syntax operates over single sentences, cross-sentence time expressions can indicate ordering, as shown in Examples (3.1) and (3.1). We exploit these regularities as well by checking for sentence-initial word combinations.
A is phosphorylated by Bbefore. As a downstream effect, C is …after A is phosphorylated by Bbefore. C is then …after Other phrases captured include: “Later”, “In response”, “For this”, and “Ultimately”.
Following chambers2014dense, we use deterministic rules to establish precedence between events that have certain verbal tense and aspect. These rules are derived from linguistic analysis of tense and aspect by [Reichenbach1947, Derczynski and Gaizauskas2013]. Example (3.1) illustrates a case in which we can accurately infer order just from this information. Because has been phosphorylated has past tense and perfective aspect, this model concludes that it precedes share (present tense, simple aspect) and thus the binding of histone H2A.
These [PTIP] proteins also share the ability to bind histone H2A (or H2AX in mammals) that has been phosphorylated….
The logic determining which tense-aspect combinations receive which precedence relations is identical to CAEVO, which is possible because it is open source888https://github.com/nchambers/caevo. However, CAEVO operates over annotations that include gold tense and aspect values, whereas this model additionally detects tense and aspect using Odin rules before applying this logic.
|Event||Event labels||The taxonomic labels Reach assigned to the event (e.g. phosphorylation Phosphorylation, AdditiveEvent, …).|
|Event trigger||The predicate signaling an event mention (ex. “phosphorylated”, “phosphorylation”).|
|Event trigger + label||A concatenation of the event’s trigger with the event’s label.|
|token n-grams with entity replacement||n-grams of the tokens in the mention span, where each entity is replaced with the entity label (ex. “the ABC protein” “the PROTEIN”). If an entity is shared between pairs of events, replace it with the label SHARED.|
|token n-grams with role replacement||n-grams of the tokens in the mention span, where each argument is replaced with the argument role (ex. “A inhibits the phosphorylation of B” “CONTROLLER inhibits the CONTROLLED”)|
Syntactic path from
trigger to args
|Variations of the syntactic dependency path from an event’s trigger to each of its arguments (unlexicalized path, path + lemmas, trigger argument role, trigger argument label, etc.).|
|Interceding tokens (n-grams)||n-grams (1-3) of the tokens between E1 and E2.|
|A concatenation of the syntactic path from the sentential ROOT to an event’s trigger (see the example in Figure 4).|
Trigger-to-trigger syntactic paths
|the syntactic path from the trigger of E1 to the trigger of E2|
|Shortest syntactic paths||The shortest syntactic path between E1 and E2 (restricted to intra-sentence cases).|
|Syntactic distance||The length of each syntactic path (restricted to intra-sentence cases).|
|Coreference||Event features for anaphors||Whether or not an event mention is resolved through coreference. For cases of coreference, generate the Event features prefixed with “coref-anaphor” for the text labeled “E1-anaphor” in the following example: A binds with BE1-antecedent This interactionE1-anaphor precedes the phosphorylation of CE2|
|Resolved arguments||Which arguments, if any, were resolved through coreference. For example: The mutanttheme binds with BthemeE1 theme:resolved|
An overview of the primary features used in the feature-based classifier, grouped into four classes:Event
– features extracted from the two participating events, in isolation;Event-Event (surface) – features that model the lexical context between the two events; Event-Event (syntax) – features that model the syntactic context between the two events; and Coreference – features that capture coreference resolution information that impact the participating events.
Most instances of causal precedence cannot be captured with deterministic rules, because they lack explicit words, phrases, or syntactic structures that unambiguously mark the relation. Using a combination of the surface, syntactic, and taxonomic features outlined in Table 3, we trained a set of statistical classifiers to detect causal precedence relations between pairs of events in our corpus. For training and testing purposes, we treated any instance not labeled as either “E1 precedes E2” or “E2 precedes E1” as a negative example. We examined the following statistical models: a linear kernel SVM [Chang and Lin2011]Fan et al.2008]
, and random forest999Abbreviated as RF [Surdeanu et al.2014]. For the SVM and logistic regression (LR) models, we also compared the effects of L1 and L2 regularization.
Due to the complexity of the task and variety of causal precedence instances encountered during the annotation process, it is unclear whether a linear combination of engineered features is sufficient for broad coverage classification. For this reason, we introduce a latent feature representation model using an LSTM [Hochreiter and Schmidhuber1997, Bergstra et al.2010, Chollet2015] to capture underlying semantic features by incorporating long-distance contextual information and selectively persisting memory of previous event pairs to aid in classification.
The basic architecture is shown in Figure 5
. The input to this model is the provenance of the relation, i.e., the whole text containing the two events and the text in between. Formally, this is represented as a concatenated sequence of 200 dimensional vectors where each vector in the sequence corresponds to a token in the minimal sentential span encompassing the event pair being classified. Intuitively, this LSTM “reads” the text from left to right and outputs a classification label from the set of three when done. We consider two variations of this model: the basic model (LSTM) with the vector weights for each token uninitialized and a second form (LSTM+P) where the vectors are initialized using pre-training. In the pre-training configuration, the vector weights are initialized using word embeddings generated by a word2vec[Mikolov et al.2013, Řehůřek and Sojka2010] model trained on the full text of over 1 million biomedical papers taken from the Open Access subset of PubMed. Because the corpus is only 1000 annotations, it was thought that pre-training could improve prediction of causal precedence and guide the model with distributional semantic representations specific to this domain.
Building on this simple blueprint, we designed a three-pronged “pitchfork” (FLSTM) where the span of E1, the span of E2, and the minimal sentential span encompassing E1 and E2 each serve as a separate input, allowing the model to explicitly address each of them as well as discover how these three inputs relate to one another. This architecture is shown in Figure 6
. Each input feeds into its own LSTM and corresponding dropout layer before the three forks are merged via a concatenation of tensors. Like the basic model, one version of the “pitchfork” is trained with vector weights initialized using the pre-trained word embeddings (FLSTM+P).
We summarize the performance of all these models on the dataset previously introduced in Table 4
. We report results using micro precision, recall, and F1 scores for each model. With fewer than 200 instances of causal precedence occurring in 1000 annotations, training and testing for both the feature-based classifiers and latent feature models was performed using stratified 10-fold cross validation. For the latent feature models, training was parameterized using a maximum of 100 epochs with support for early stopping through monitoring of validation loss101010The validation set used for each fold came from a different class-balanced fold.. Weight updates were made on batches of 32 examples and all folds completed in fewer than 50 epochs.
The table also includes a sieve-based ensemble system, which performs significantly better than the best-performing single model. In this architecture, the sieves are applied in descending order of precision, so that the positive predictions of the higher precision sieves will always be preferred to contradictory predictions made by subsequent, lower-precision sieves. Figure 7 illustrates that as sieves are added, the F1 score remains fairly constant, while recall increases at the cost of precision.
Despite some obvious patterns noted in Table 1, the deterministic models perform the worst due in large part to their rarity in the corpus. An analysis of this result is given in Section 5. Overall, our top-performing model was the linear kernel SVM with L1 regularization. In all cases, the feature-based classifiers outperform the latent feature representations, suggesting that in cases such as this where little data is available, feature-based classifiers capitalizing on high-level linguistic features are able to better generalize and thus outperform latent feature models. However, as our discussion in Section 5.1 will show, our combined model demonstrates that the latent and feature-based models are largely complementary.
Overall, results are promising, particularly in light of the conscious choice to omit (causal) regulation reactions from this task, as they are already captured by the Reach reading system.
However, the deterministic models created so far have extremely low recall, such that it is difficult even to determine their precision. An analysis of the Reichenbach model reveals one source of this low coverage. In short, although writers could describe causal mechanisms using temporal indicators such as tense and aspect, temporal description is rare enough in this domain not to be represented in our randomly sampled database. Table 5 illustrates the lack of overlap with informative tense-aspect combinations; a single tense is used per passage, and no perfective aspect is used.
Similarly, the time expressions required by the deterministic intra- and inter-sentence precedence rules are rare enough to make them ineffective on this sample.
As chambers2014dense, 1604.08120, and many other algorithms have shown, models can be applied sequentially in “sieves” to produce higher-quality output. Ideally, each model in a sieve-based system will capture different portions of the data through a mixture of approaches, distinguishing this method from more naive ensembles in which the contributions of a lone component would be washed out. Figure 8 details this observation by showing the coverage difference between the models described here.
We performed an analysis of the false positives shared by all feature-based classifiers, in addition to the false negatives shared by all models. Here we limit our discussion to only the most prominent characteristic shared by the majority of false positives.
More than half of the false positives share contrastive discourse features, suggesting that a model of discourse could improve classifier discrimination. Example (5.2) demonstrates such a contrastive structure, which whereas introduces a clause (and event) that is contrasted and therefore both temporally and causally distinct from the following clause (and event). The existence of regular cues like whereas indicates that a feature to explicitly model these structures is possible.
Whereas PRAS40 inhibits the mTORC1 activity via raptorE1, DEPTOR was identified to interact directly with mTOR in both mTORC1 and mTORC2 complexesE2
Though focused on temporal ordering, chambers2014dense adopt a sieve-based approach, with high-precision deterministic sieves preceding and constraining lower-precision, higher-recall machine learning sieves. As with our system, the deterministic sieves were linguistically motivated, and had the additional advantage of operating over time expressions (during, Friday, etc.) as well as events, the former of which are typically lacking in the biomedical domain.
1604.08120 implemented a hybrid sieve-based approach for causal relation detection between events that includes a set of causal verb rules and corresponding syntactic dependencies and a feature-based classifier. However, both of these works focus on open-domain texts. To our knowledge, we are the first to investigate causal precedence in the biomedical domain.
These are the first experiments regarding automatic annotation of causal precedence in the biomedical domain. Although the dearth of temporal expressions and other regular linguistic cues make the task especially difficult in this domain, the initial results are promising, and demonstrate that a sieve-based system of the models tested here improves performance over the top-performing individual component. Both the annotation corpus and the models described here represent large steps toward linking automatic reading to a larger, more informative biological mechanism.
This work was funded by the Defense Advanced Research Projects Agency (DARPA) Big Mechanism program under ARO contract W911NF-14-1-0395.
LIBSVM: A library for support vector machines.ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task, pages 1–9. Association for Computational Linguistics.