Progress has been made in automating biological event extraction from biomedical texts , but little attention has been given to identifying and associating the biological context in which such events occur. Biological context, however, often plays a critical role in interpreting these events. For example, the following is a summary of a key finding in a paper by Young and Jacks :
Mutations in oncogenes are much more likely to lead to cancer in some tissue types than others, because some tissues express other proteins that counteract the oncogene. For example, in mice, the G12D activating mutation in K-ras causes lung tumors but not muscle-derived sarcomas, because muscle cells express two proteins (Arf and Ink4a) that cause cell division to halt when Ras is overactive.
An automated event extraction system might extract the biochemical event “G12D activates mutation in K-ras”, but without understanding the biological context – of whether this event occurs in lung or muscle tissue – the reader will not understand why the event does or does not lead to cancer.
Biological context is not only important, it also comes in many varieties. Here we focus on biological container context, where a biological “container” may be specified at various levels of granularity, but each level serves to further specify the type of biological system in which an event might occur. From the highest level of granularity, we consider species (human, mouse), then tissue (lung, lymphoid), and finally cell type (t-cell, endothelial). Container contexts across levels often stand in mereological (“part-whole”) relationships, but knowing a finer level of granularity does not always fully determine higher levels. For example, a species may contain several tissue types, but these may also be present in other species.
To fully understand the biological context in which an event occurs, we need to know the container types at each level of specification. In fact, a special case of biological context specification comes in the form of naming the cell line used in experiments. Cell lines comprise a specific cell culture cloned from a single cell and therefore consist of cells with a uniform genetic makeup. Cell lines available for purchase typically specify the cell type, tissue, and species from which they were derived. For example, the PCS-100-020111http://www.atcc.org/en/Products/Cells_and_Microorganisms/Human_Primary_Cells/Cell_Type/Endothelial_Cells/PCS-100-020.aspx cell line is derived from endothelial cells of the artery tissue of a human (species).
In this paper we treat the problem of extracting biological container context as one of identifying
container context mentions, a problem of named entity recognition (NER), and ofassociating them with events, a kind of relation extraction. A key challenge for context association is that context mentions are often not found in the same sentence as the event, making this an inter-sentential relation extraction problem. For example, consider the following excerpt :
This route promotes the translocation of Rac1/RhoGDI to F-actin-rich membrane areas, the Pak-dependent release of RAC1 from the complex and Rac1 activation. This pathway is important for optimal Rac1 activation during the signaling of the EGF receptor, integrins and the antigenic T-cell receptor.
Here, the three underlined events in the first sentence are associated with the T-cell context in the second sentence.
We make the following contributions in this paper: (1) provide an analysis of the context-event inter-sentential relation extraction problem, (2) develop a corpus of context-event relations for evaluation, and (3) present first results of an inter-sentential context extraction and association model that provides a baseline for future work.
Ii Related Work
The context association problem relates to two general problems that have been studied in the natural language processing and linguistics communities.
The first problem, relation extraction, has received extensive attention [4, 5], including within the biomedical domain [6, 7], with recent promising results incorporating distant supervision . All of this work, however, focuses on identifying relations among entities within the same sentence. The context association problem, on the other hand, deals with inter-sentential relations, and as Bach and Badaskar (2007) note, “it is not straightforward to modify [sentence-level] algorithms … to capture long range relations.”
Very little prior work has studied inter-sentential relation extraction. A notable exception, Swampillai & Stevenson , combined within-sentence syntactic features with an introduced dependency link between the root nodes of parse trees from different sentences that contain a given pair of entities. Swampillai & Stevenson used these features to train an SVM to extract inter-sentential relations from the MUC6222https://catalog.ldc.upenn.edu/LDC20003T13 corpus. In contrast, our work is within the biomedical domain, requiring the development of a different set of features, and we also develop a novel feature aggregation technique that facilitates improved context association, as described in the following sections.
Context-event association also bears similarity to a second problem, bridging anaphora resolution, which has been primarily investigated theoretically in the linguistics literature. Bridging anaphora aims at identifying associations between entities at the discourse (rather than single-sentence) level. As Irmer  notes, the relation between the two entities “is not explicitly stated by linguistics means”, but knowledge of the relation “is necessary for successfully interpreting a discourse.” As in the case of container contexts, for example, the relation may be mereological: e.g., I looked into the room. The ceiling was very high. [10, p. 162].
presents a computational model of bridging anaphora that makes use of Discourse Representation Theory to create a rule-based system for determining what kind of bridging anaphoric relationship two entities might have. By contrast, Poesio et al.
developed a multi-layer perceptron classifier that uses a measure of lexical distance derived from the WordNet database, among other features, to achieve a maximum accuracy score of 79.3% on a small corpus. Both models provide interesting approaches to subclasses of bridging anaphora resolution, but neither generalizes to the biomedical context-event association problem, where a complete reworking of relevant features has been required to successfully associate biological container context with events.
Some prior art exists specifically to contextualize biochemical events. Gerner et al.  associates anatomical contextual containers to event mentions that appear in the same sentence, via a set of rules that considers lexical patterns in the case of ambiguity and falls back to token distance if no pattern is matched. Sarafraz  elaborates on the same idea by incorporating dependency trees into the rules instead of lexical patterns, as well as introducing a method to detect negations and speculative statements. The proposed method we present in this paper is related to this prior art in the sense that we attribute contextual relation between entities and biochemical events, but focus on inter-sentential relations, instead of intra-sentential ones.
Iii The Context-Event Relation Corpus
With the help of three biology domain experts, we compiled an annotated corpus of biological container context mentions associated with biochemical events. The corpus consists of 22 biomedical research papers about the Ras cancer pathway. All of the papers are available from the PubMed Open Access333http://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/ repository. The complete set of annotations are also open source and available online.444https://ml4ai.github.io/BioContext
The first step in constructing the annotation corpus involved identifying mentions of biochemical events within the text. Here, a biochemical event is a relation between one or more entities participating in a biochemical reaction or its regulation. A mention of a biochemical event can be identified by a trigger word, where trigger words are usually the name of the chemical reaction, e.g., phosphorylation, ubiquitination, expression, etc. A sentence may contain more than one event. For example, the phrase “phosphorylation of plexin-As by nonreceptor tyrosine kinases Fes and Fps and Fyn” contains a total of four events: one phosphorylation event and three different regulation events (each kinase regulates the phosphorylation). Each trigger word is considered once, and forms the basis for one event in the corpus.
We identified biochemical events using two independent methods. First, we asked our biology domain experts to go through the papers and identify spans of text that they believe express one or more biochemical events. For this task, they were provided with an interface by which they could simply highlight spans of text (or move span boundaries). Contiguous spans of text could contain a mention of more than one biochemical reaction, but we treat each contiguous span of text as just a single mention.
We also used REACH [16, 17], an open source biomedical event extraction system built on top of the ODIN information extraction library , to identify biochemical events. REACH associates with each extracted event the source span of text that served as evidence for the event. We combined any event mentions whose spans overlap to count them as one single event. In the example above, REACH does identify 4 separate event mentions, but they would be combined into one event mention text span.
The next step in the corpus construction involved identifying any mentions of biological container context. REACH includes a named entity recognition (NER) facility that detects candidate context mentions in text and grounds them to a consistent ID. The NER facility works by matching word tokens against multiple knowledge bases. These knowledge bases are dictionaries that map a sequence of words to a unique grounding identifier. It is possible for several different words/phrases to share the same identifier in the case of multiple lexical expressions referring to the same kind of entity, e.g. “human” can be indicated by woman, man, patient, child, etc. This design is inspired by the Linnaeus system, a taxonomy-based NER system for labeling species mentions , as a matter of fact, the species’ NER knowledge base is a subset of the Linnaeus dictionary. Every individual knowledge base in REACH has a category, these categories are species, organs, tissue types, cell lines and cellular components. Each of these categories represent a different notion of biological container and are not mutually exclusive. They were put together by scraping specialized websites that contain curated enumerations of entities belonging to those categories. When a sequence of words match an entry of a knowledge base, its category together with the grounding identifier conform to a context type. For this work, we have restricted our use to the knowledge bases of species, tissue types and cell lines. While context mentions also take up spans of text (usually one to just a few words), they generally do not overlap, unlike event mentions.
With the spans of text identified as containing biological event and context mentions, we then asked the domain expert annotators to identify the context mentions associated with each span of text associate with one or more events. For this task, the annotators were provided an annotation tool that displayed the original text, with spans of text associated with an event highlighted in green and spans of text associated with a context mention highlighted in yellow; the annotators could then select an event and context span and indicate whether they are associated. Each container context mention associated with an event (as text span) is then taken to constitute one positive instance of a context-event relation. Multiple context mentions (whether of the same type, e.g., two instances of human, or different, e.g., one instance of human and one of rat) may be associated with the same event, each comprising a separate context-event relation.
Iii-a Negative examples and extended positive examples
The annotation process produced a gold-standard set of event-context associations, but two problems remain. First, the annotations provided by the domain experts consisted of only positive examples. The annotators reported that it was very unnatural to identify explicit negative examples, that is, contexts that were categorically not related to a given event. As our classifier learning framework described in Section IV
requires both positive and negative examples, we therefore developed a method for estimating negative instances.
The second, related, problem is that each annotation relates an event to a specific context mention. However, other instance mentions of that context type might be mentioned in other sentences that the annotators did not label (we will return to this distinction between mention and type again in Section IV). Again, annotators found it most natural to identify the context mention instances that were directly relevant, but not exhaustively include all instances that might also be relevant or irrelevant. We make the simplifying assumption that if an annotator associated one context mention with an event, then for the purposes of constructing a training data set, all other instances of that context type mentioned in the paper are also relevant to associating that context with the event.
We used the REACH-extracted context types that were not annotated by our domain experts to be associated with an event to build a set of negative example context-event pairs (addressing the first problem) and to extend the number of positive example context mentions and event pairs (addressing the second problem). These were constructed as follows: First, each paper (represented in an XML format) was processed by an NLP pipeline  to transform it into a plain text representation separated by sentence. This representation allowed us to associate every annotation with its location relative to the sentences in the corpus. Next, we considered all the REACH-extracted events paired with each event mention; if the context mention did not have the same grounding ID as one of the expert-annotated context mentions for that event, it was labeled as a negative example for that event; on the other hand, if it did have the same ID as one of the expert-annotated contexts, then it was labeled a positive example.
The above procedure resulted in two context mention sets, each containing zero or more context mentions that come from sentences throughout the paper: one set representing positive context mention associations, the other representing context mentions whose context type are assumed to not be associated with the event (negative associations).
In Section V, we evaluate the performance of a set of classifiers designed to label context-types associated with events. Each classifier takes as input a paper with already identified events and, based on context mentions extracted by REACH, determines what context types are associated with each event. Rather than test whether each context mention individually indicates that a context type is associated with the event, we instead aggregate the evidence of all context mentions of the same context type
(as indicated by the REACH grounding ID). This evidence aggregation is achieved by extracting a feature vector associated with each context mention and event instance and then combining the feature vectors that share the same context type (this is described in the SectionIV). After aggregation, our data set corpus derived from the annotated context-event associations consists of positive instances of events associated with particular context types, and 20,000 instances of events that are negatively associated with context types.
Iii-B Inter-Annotator Agreement
A subset of 11 papers was used to analyze inter-annotator agreement. Each of the papers’ events were considered together with the set of context mentions found in that paper’s text: Each potential context-type paired with an event was treated as a binary classification task, where each of the three annotators judged whether the context type was associated with the event. Because the set of potential context types varies across papers, we calculated Fleiss’ Kappa  scores, , measuring the agreement between the three annotators for each context type separately. Figure 1 shows the general distribution of the scores and the frequency with which these contexts were associated with events, and Figure 2 shows the scores for the top 15 context types by association count.
In addition to inter-annotator agreement, we also measured the amount of agreement between REACH and our annotators by looking at the degree of overlap between the text spans about events picked out by our annotators and the event spans picked out by REACH. Event spans were taken to be overlapping if they shared at least one word between them, and the REACH spans were considered against the set of manual events that were common to all three annotators. Out of a total of 1629 events, 130 were picked out by only REACH, 626 were picked out by only the annotators, and 873 were were identified by both, resulting in a Jaccard similarity index of .
Qualitatively, the domain experts suggested a number of reasons why the agreement between annotators, for context associations, and with REACH, for event spans, might have varied relatively widely. First, the contexts mentioned in a text are sometimes themselves modified in the course of setting up experimental conditions. Consider the following example from Hazeki et al. :
To further investigate the role of p110 in CpG localization, Cos7 cells were transfected with p110 and its mutant forms (unlike macrophages, Cos7 cells do not express p110).
Some desired property (e.g., the expression of p110) might not usually be found in some context (e.g., the Cos7 cell line), so our annotators sometimes disagreed about whether that context should indeed be associated with subsequent events in the paper.
The annotators also observed that sometimes event spans picked out by REACH properly contained more than one actual event, and might then disagree about whether that span as a whole should be associated with some context.
Finally, the annotators noted that container contexts from the more granular level (e.g., species) might not be salient in papers dealing with very low-level events (e.g., interactions at the molecular level, or crystal structure studies), and therefore disagreed about how to assess granular context associations.
These are very important observations and point to the need for the further technology developments required to fully capture all of the semantics of context. In this work, we have preserved the original annotations, but more sophisticated parsing (e.g., of more of the component structure of biochemical events and of each paper’s experimental setup) will be needed to properly tackle these concerns. We leave these as open problems for future work.
|Sentence distance||No. of sentences separating the event and context mentions|
|Dependency distance||No. of edges separating the mentions within dependency graph|
|Context type frequency||No. of context mentions of the same type|
|Is context closest||Indicates whether the context mention is the closest one to the event|
|Is sentence first person||An instance for each: event and context mentions|
|Is sentence past tense|
|Is sentence present tense|
|Event spanning dependency bigrams||Sequence of dependency bigrams spanning from event mention|
|Negated event mention||Indicates whether a neg dependency is within 2 degrees in dep. graph|
|Context spanning dependency bigrams||Sequence of dependency bigrams spanning from context mention|
|Negated context mention||Indicates whether a neg dependency is within 2 degrees in dep. graph|
Inter-sentence relation extraction is more challenging than intra-sentence relation extraction primarily because a number of traditional linguistic features, such as information about syntactic dependencies, are unavailable across sentences.
We model inter-sentence context relation extraction as a supervised learning problem. As discussed in SectionIII-A, we are considering the task of identifying whether a context type is associated with an event mention, given evidence from other context mentions in the text, although our corpus consists of annotations of relations between event and context mentions. To model this task, we aggregate all of the instances of the features associated with each context-mention/event-mention relation that share the same context type. This is done by first constructing a feature vector for each individual context-mention/event-mention relation. We can then consider different feature vector aggregation schemes to construct a single feature representation for the evidence in the paper for the relationship between a context type and event mention.
We begin by describing the features that make up the context-mention/event-mention feature vectors and then describe how they are aggregated. Similar to the representation scheme used by , we incorporate local syntactic features associated with the context mentions and events. However, we also incorporate several measures of the distance between context mentions and events.
Table I summarizes the features used for this work, grouped into three functionally similar categories. Features from the general category concern kinds of distances. Sentence distance counts the number of sentences between the context and event mentions. If they’re in the same sentence it takes a value of zero, adjacent sentences are distance one, and so on. Dependency distance is similar in spirit, but counts how many edges away the two mentions are on the dependency parse graph. If the two mentions aren’t in the same sentence, an artificial edge connecting the roots of each of the dependency graphs of the two sentences is introduced and then the edge count is performed. Context type frequency is the count of context mentions of the same type within the current document, and is context closest is a Boolean value that is 1 when the context mention is the closest to the event mention, otherwise 0. Phi features represent other linguistic characteristics of the sentence containing the mentions. For each of the listed phi features, an instance is created for each the sentence in which the context and event mentions occur. These features are also Boolean valued, set to 1 whenever the assertion holds true, otherwise 0. Part-of-Speech tags were used to implement these features, e.g., the past tense feature uses the verb’s tag to check is it’s contained in the set of the possible past tenses (VBD or VBN).555https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html Similarly for the other two features in this category. Syntactic features rely on dependency parses of the containing sentences and are dynamically generated. Spanning dependency bigrams are derived from the spanning tree rooted at the head token of a mention of depth two. The bigram features derived from all of the dependency paths are combined in a bag-of-bigrams where the label of the edges on a path become the an element of its corresponding bigram. Negated mention features look for the presence of a neg dependency within the spanning tree just described; if present they are set to 1, otherwise 0.
We have explored several feature aggregation methods. The method we found that works best and is the basis of the results reported here is to form a single vector constructed out of statistical summaries of the the individual context-mention/event-mention feature vectors. In particular, we compute the average, minimum value and maximum value for each feature vector element across the feature vectors in the set of context-mention/event-mention feature vectors, resulting in an average, minimum, and maximum value vector (respectively); these are then concatenated to form a final vector three times the length of original vectors.
Using the feature representation described in Section IV, we trained and evaluated a number of different supervised learning classifiers for the context-event association task, within a cross-validation evaluation framework.
The intended use case for our proposed method is to take as input a new paper, pre-process it with a machine reading system, such as REACH, in order to extract all context and event mentions, and then run the context-event association classifier to label events by their associated context type. For this reason, the natural unit of input is a single paper. Each paper is therefore a fold in our cross-validation evaluation, and we have a total of 22 papers in our corpus, for 22 folds. We performed leave-one-out cross-validation, iteratively holding out one paper as the test set.
In this evaluation, we used micro-averaged F1 scores . A micro-average score weights the contribution of each fold proportionally to the amount of data it contributes to the overall data set, making the final score more robust to fold results that could contain a proportionally small number of annotated events and therefore not be as representative.
At each iteration, we need to train the model but also search for the combination of features that performs the best with that model. In order to have a basis for evaluating how well a feature set performs, we partitioned (uniformly random) the remaining 21 papers into a validation set of 4 papers and a set of 17 papers to use for training. We then considered different combinations of features by considering the power set666… minus the empty feature set. of feature groups as described in Table I. In total, there were possible combinations of features for each classifier. The cardinality of the power set is considerable, as a result, the experiments were performed on a HPC cluster777An allocation of computer time from the UA Research Computing High Performance Computing (HPC) at the University of Arizona is gratefully acknowledged.
to find the optimal combination of features for each of the machine learning algorithms without relying on feature selection approximation heuristics. For each of these feature sets, we trained a model and evaluated its performance on the validation set. The model with the feature set that achieved the highest validation F1 was then evaluated (with no further changes) on the held-out test set. We then repeated this procedure for each iteration of leave-one-out cross validation.
A simple but reasonable deterministic classifier was developed to serve as a baseline, described in Algorithm 1. Intuitively, the baseline classifier does the following: given the index of the sentence in which an event occurs, build a two sided interval of width- sentences around the event-sentence and conclude that any context mentioned in the sentences within the window are associated with the event.
This baseline classifier was run within the same cross validation loop and was “trained” by performing a parameter search for and the best is selected according to performance on the validation set. In this way, the predictive capability of the algorithm can be compared in the same terms as the machine learning models.
Beyond the baseline model, we evaluated the following classifiers:
The hyper-parameters for each of the algorithms, such as the regularization coefficient on the logistic regression, the degree of the polynomial kernel, the maximum depth for the trees in the random forest, etc., where tuned with manual exploration. The feed-forward neural network had a single hidden layer.
Figure 3 compares the micro-averaged precision, recall, and F1 scores for each model evaluated in the cross validation. Again, these averages are computed across the 22 folds of cross-validation, where for each model within each fold, a search was performed to find the combination of features that allowed the model to perform best on the validation set for that fold. The dashed line in the figure indicates the micro-averaged F1 score for the baseline classifier. In general, the trained classifiers all achieved average F1 scores higher than the baseline, with the best performing models being the random forest and neural network classifiers.
To test whether each non-baseline classifier performed significantly better than the baseline, we performed a bootstrap resampling test  where for each model we uniformly randomly sampled with replacement the same number of context-to-event associations as in the original 22 papers, computed the F1 scores of the model and the baseline on that sample, took the difference of the baseline F1 from the model F1, and repeated this 1000 times (per model). For each non-baseline classifier we found that its F1 scores exceeded the baseline in at least 95% of the cases.
Figure 4 shows, for each model, the frequency with which each feature ended up being selected as part of the set of features that allowed the model to perform best on the validation papers (as there were a total of 22 cross validation folds, the maximum possible frequency is 22). This provides some insight into which features, in general, tended to provide more useful information, for each model. The spanning dependency bigrams with respect to the event mentions are seldom used by the classifiers, but the dependency bigrams with respect to the context mentions are frequently used, suggesting that syntax is correlated with the presence of a context relation. The Is context closest boolean feature is one of the most frequently used features. This is consistent with the intuition that context information gets established close to the statements of interest, in this case, biochemical reactions and biological processes. Another interesting pattern is that the context class frequency is also almost always used, suggesting that the number of times a context class is mentioned is also highly correlated with whether the context will be associated with an event.
Finally, Figure 5 shows the differences in model performance across the papers. In the figure, each column represents the F1 score of each model on the respective paper, where papers are sorted by the F1 score of the baseline (red x’s) from least to greatest. The x-axis labels reference the last three digits of the paper PubMed ID, and below in parentheses are the total number of context-mention/event-mention candidate relations involved in providing evidence for the context type label. The overall best-performing feed-forward network (whose F1 is denoted by the black x’s) generally performed significantly better than the baseline, except for papers #906 and #001.
In this paper we introduced the problem of extracting and associating biological container context with biochemical events in biomedical texts. We cast this as an inter-sentential relation extraction problem, where the entities being related (in this case, biochemical interaction event mentions and biological container context mentions) can be, and often are, a number of sentences apart from each other. To date, very little work has been done on contextual relation extraction, and more work is needed to develop domain-general techniques. However, we believe our contribution here takes some steps in this direction, and provides a strong baseline for work in the application domain of association biological container context with biochemical events.
We developed a set of features for the this domain and demonstrated their variable use for this task with a variety of state of the art classification methods. The categories of features include syntactic features, distance-based features, phi features, and frequency-based features.
There is ample room for improvement. We believe improvements to discourse modeling and parsing will be a key source of future advances in inter-sentential relation extraction. In particular, biomedical research articles have conventional structure with an expected set of sections: Introduction, Materials and Methods, etc. These sections in turn have different contents, and we are interested in better exploiting the particular discourse properties of each to improve how we extract the associations of information embedded in the paper. For example, a context type mentioned in the Abstract section may be more relevant to events across the paper, whereas a particular cell line mentioned multiple times in a long Methods section could have high importance locally, but be much less relevant in other sections. To leverage this structural properties, discourse-based features could be used in tandem with sequence-aware machine learning algorithms could be used, such as recurrent and LSTM deep neural networks.
The annotated corpus, code, and instructions used to implement the experiments described in this paper can be found at https://ml4ai.github.io/BioContext.
This work was supported by the Defense Advanced Research Projects Agency (DARPA) Big Mechanism [ARO W911NF-14-1-0395]. We also thank the University of Arizona Research High Performance Computing support team.
-  D. Zhou, D. Zhong, and Y. He, “Biomedical relation extraction: From binary to complex,” Computational and Mathematical Methods in Medicine, 2014.
-  N. P. Young and T. Jacks, “Tissue-specific p19Arf regulation dictates the response to oncogenic k-ras,” Proceedings of the National Academy of Sciences of the United States of America, vol. 107, no. 22, pp. 10184–10189, 2010.
-  X. R. Bustelo, V. Ojeda, M. Barreira, V. Sauzeau, and A. Castro-Castro, “Rac-ing to the plasma membrane: the long and complex work commute of rac1 during cell signaling,” Small GTPases, vol. 3, no. 1, pp. 60–66, 2012.
M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni, “Open
information extraction from the web,” in
Proceedings of the Twentieth International Joint Conference on Artificial Intelligence, pp. 2670–2676, 2007.
-  N. Bach and S. Badaskar, “A review of relation extraction,” Literature review for Language and Statistics II, 2007.
-  C. Quan, M. Wang, and F. Ren, “An unsupervised text mining method for relation extraction from biomedical literature,” PLOS One, 2014.
-  K. Fundel, R. Küffner, and R. Zimmer, “RelEx – Relation extraction using dependency parse trees,” Bioinformatics, vol. 23, no. 3, pp. 365–371, 2007.
-  H. Poon, K. Toutanova, and C. Quirk, “Distant supervision for cancer pathway extraction from text,” in Pacific Symposium for Biocomputing, 2015.
-  K. Swampillai and M. Stevenson, “Extracting relations within and across sentences,” in Proceedings of Recent Advances in Natural Language Processing, 2011.
-  M. Irmer, Bridging Inferences in Discourse Interpretation. PhD thesis, University of Leipzig, 2009.
-  S. A. A. d. Freitas, Interpretação automatizada de textos: Processamento de Anáforas. PhD thesis, Universidade Federal do Espírito Santo, Brasil, 2005.
-  M. Poesio, R. Mehta, A. Maroudas, and J. Hitzeman, “Learning to resolve bridging references,” in Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), Main Volume, (Barcelona, Spain), pp. 143–150, July 2004.
-  C. Fellbaum, ed., WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press., 1998.
-  M. Gerner, G. Nenadic, and C. M. Bergman, “An exploration of mining gene expression mentions and their anatomical locations from biomedical text,” in Proceedings of the 2010 Workshop on Biomedical Natural Language Processing, pp. 72–80, Association for Computational Linguistics, 2010.
-  F. Sarafraz, Finding conflicting statements in the biomedical literature. PhD thesis, University of Manchester, 2012.
-  M. A. Valenzuela-Escárcega, G. Hahn-Powell, D. Bell, T. Hicks, E. Noriega, M. Surdeanu, and C. T. Morrison, “Reach.” https://github.com/clulab/reach, 2018.
-  M. A. Valenzuela-Escárcega, Ö. Babur, G. Hahn-Powell, D. Bell, T. Hicks, E. Noriega-Atala, X. Wang, M. Surdeanu, E. Demir, and C. T. Morrison, “Large-scale automated reading with reach discovers new cancer driving mechanisms,” in Proceedings of the BioCreative VI Workshop (BioCreative6), 2017.
-  M. A. Valenzuela-Escárcega, G. Hahn-Powell, T. Hicks, and M. Surdeanu, “A domain-independent rule-based framework for event extraction,” in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing: Software Demonstrations (ACL-IJCNLP), pp. 127–132, ACL-IJCNLP 2015, 2015. Paper available at http://www.aclweb.org/anthology/P/P15/P15-4022.pdf.
-  M. Gerner, G. Nenadic, and C. M. Bergman, “LINNAEUS: A species name identification system for biomedical literature,” BMC Bioinformatics, vol. 11, p. 85, 2010.
-  C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky, “The stanford corenlp natural language processing toolkit,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL), 2014.
-  J. L. Fleiss and J. Cohen, “The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability,” Educational and psychological measurement, vol. 33, no. 3, pp. 613–619, 1973.
-  K. Hazeki, Y. Kametani, H. Murakami, M. Uehara, Y. Ishikawa, K. Nigorikawa, S. Takasuga, T. Sasaki, T. Seya, M. Matsumoto, and O. Hazeki, “Phosphoinositide 3-kinase controls the intracellular localization of cpg to limit dna-pkcs-dependent il-10 production in macrophages,” PLoS One, vol. 6, no. 10, 2011.
-  C. D. Manning, P. Raghavan, and H. Schütze, Introduction to information retrieval. Cambridge University Press, 2008.
-  P. R. Cohen, Empirical methods for artificial intelligence, vol. 139. MIT press Cambridge, MA, 1995.