Leveraging Dependency Forest for Neural Medical Relation Extraction

11/11/2019 ∙ by Linfeng Song, et al. ∙ 0

Medical relation extraction discovers relations between entity mentions in text, such as research articles. For this task, dependency syntax has been recognized as a crucial source of features. Yet in the medical domain, 1-best parse trees suffer from relatively low accuracies, diminishing their usefulness. We investigate a method to alleviate this problem by utilizing dependency forests. Forests contain many possible decisions and therefore have higher recall but more noise compared with 1-best outputs. A graph neural network is used to represent the forests, automatically distinguishing the useful syntactic information from parsing noise. Results on two biomedical benchmarks show that our method outperforms the standard tree-based methods, giving the state-of-the-art results in the literature.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The sheer amount of medical articles and their rapid growth prevent researchers from receiving comprehensive literature knowledge by direct reading. This can hamper both medical research and clinical diagnosis. NLP techniques have been used for automating the knowledge extraction process from the medical literature (Friedman et al., 2001; Yu and Agichtein, 2003; Hirschman et al., 2005; Xu et al., 2010; Sondhi et al., 2010; Abacha and Zweigenbaum, 2011). Along this line of work, a long-standing task is relation extraction, which mines factual knowledge from free text by labeling relations between entity mentions. As shown in Figure 1, the sub-clause “previously observed cytochrome P450 3A4 ( CYP3A4 ) interaction of the dual orexin receptor antagonist almorexant” contains two entities, namely “orexin receptor” and “almorexant”. There is an “adversary” relation between these two entities, denoted as“CPR:6”.

Previous work has shown that dependency syntax is important for guiding relation extraction (Culotta and Sorensen, 2004; Bunescu and Mooney, 2005; Liu et al., 2015; Gormley et al., 2015; Xu et al., 2015a, b; Miwa and Bansal, 2016; Zhang et al., 2018b), especially in biological and medical domains Quirk and Poon (2017); Peng et al. (2017); Song et al. (2018b). Compared with sequential surface-level structures, such as POS tags, dependency trees help to model word-to-word relations more easily by drawing direct connections between distant words that are syntactically correlated. Take the phrase “effect on the medicine” for example; “effect” and “medicine” are directly connected in a dependency tree, regardless of how many modifiers are added in between.

Figure 1: (a) 1-best dependency tree and (b) dependency forest for a medical-domain sentence, where edge label “comp” represents “compound”. Associated mentions are in different colors. Some irrelevant words and edges are omitted for simplicity.

Dependency parsing has achieved an accuracy over 96% in the news domain (Liu and Zhang, 2017; Kitaev and Klein, 2018). However, for the medical literature domain, parsing accuracies can drop significantly (Lease and Charniak, 2005; McClosky and Charniak, 2008; Sagae et al., 2008; Candito et al., 2011). This can lead to severe error propagation in downstream relation extraction tasks, offsetting much of the benefit that relation extraction models can obtain by exploiting dependency trees as a source of external features.

We address the low-accuracy issue in biomedical dependency parsing by considering dependency forests as external features. Instead of 1-best trees, dependency forests consist of dependency arcs and labels that a parser is relatively confident about, therefore having better recall of gold-standard arcs by offering more candidate choices with noise. Our main idea is to let a relation extraction system learn automatically from a forest which arcs are the most relevant through end-task training, rather than relying solely on the decisions of a noisy syntactic parser. To this end, a graph neural network is used for encoding a forest, which in turn provides features for relation extraction. Back-propagation passes loss gradients from the relation extraction layer to the graph encoder, so that the more relevant edges can be chosen automatically for better relation extraction.

Results on BioCreative VI ChemProt (CPR) (Krallinger et al., 2017) and a recent dataset focused on phenotype-gene relations (PGR) (Sousa et al., 2019) show that our method outperforms a strong baseline that uses 1-best dependency trees as features, giving the state-of-the-art accuracies in the literature. To our knowledge, we are the first to study dependency forests for medical information extraction, showing their advantages over 1-best tree structures. Our code is available at http://github.com/freesunshine0316/dep-forest-re.

2 Related work

Syntactic forests

There have been previous studies leveraging constituent forests for machine translation (Mi et al., 2008; Ma et al., 2018; Zaremoodi and Haffari, 2018)

, sentiment analysis

(Le and Zuidema, 2015)

and text generation

(Lu and Ng, 2011). However, the usefulness of dependency forests is relatively rarely studied, with one exception being Tu et al. (2010), who use dependency forests to enhance long-range word-to-word dependencies for statistical machine translation. To our knowledge, we are the first to study the usefulness of dependency forests for relation extraction under a strong neural framework.

Graph neural network   Graph neural networks (GNNs) have been successful in encoding dependency trees for downstream tasks, such as semantic role labeling (Marcheggiani and Titov, 2017), semantic parsing (Xu et al., 2018), machine translation (Song et al., 2019; Bastings et al., 2017), relation extraction (Song et al., 2018b) and sentence ordering (Yin et al., 2019). In particular, Song et al. (2018b) showed that GNNs are more effective than DAG networks (Peng et al., 2017) for modeling syntactic trees in relation extraction, which cause loss of important structural information. We are the first to exploit GNNs for encoding search spaces in the form of dependency forests.

3 Task

Formally, the input to our task is a sentence , where is the number of words in the sentence and represents the -th input word. is annotated with boundary information ( and ) of target entity mentions ( and ). We focus on the classic binary relation extraction setting (Quirk and Poon, 2017), where the number of associated mentions is two. The output is a relation from a predefined relation set , where “None” means that no relation holds for the entities.

Two steps are taken for predicting the correct relation given an input sentence. First, a dependency parser is used to label the syntactic structure of the input. Here our baseline system takes the standard approach, using the 1-best parser output tree as features. In contrast, our proposed model uses the most confident parser forest as features. Given or , the second step is to encode both and using a neural network, before making a prediction.

We make use of the same graph neural network encoder structure to represent dependency syntax information for both the baseline and our model. In particular, a graph recurrent neural network architecture

(Beck et al., 2018; Song et al., 2018a; Zhang et al., 2018a) is used, which has been shown effective in encoding graph structures (Song et al., 2019)

, giving competitive results with alternative graph networks such as graph convolutional neural networks

(Marcheggiani and Titov, 2017; Bastings et al., 2017).

4 Baseline: DepTree

As shown in Figure 2, our baseline model stacks a bidirectional LSTM layer to encode an input sentence with a graph recurrent network (GRN) to encode a 1-best dependency tree, which extracts features from the sentence and the dependency tree , respectively. Similar model frameworks have shown highly competitive performances in previous relation extraction studies (Peng et al., 2017; Song et al., 2018b).

Figure 2: Framework of our baseline and model.

4.1 Bi-LSTM layer

Given the input sentence , we represent each word with its embedding to generate a sequence of embeddings . A Bi-LSTM layer is used to encode the sentences:


where the state of each word is generated by concatenating the states of both directions:


4.2 GRN layer

A 1-best dependency tree can be represented as a directed graph , where includes all words and represents all dependency edges (Marcheggiani and Titov, 2017). Each triple corresponds to a dependency edge, where modifies with an arc label . Each word is associated with a hidden state that is initialized with the Bi-LSTM output . The state representation of the entire tree consists of all word states:


In order to capture non-local interactions between words, the GRN layer adopts a message passing framework that performs iterative information exchange between directly connected words. As a result, each word state is updated by absorbing larger contextual information through the message passing process, and a sequence of state transitions is generated for the entire tree. The final state , where

is a hyperparameter representing the number of state transitions.

Message passing   The message passing framework takes two main steps within each iteration: message calculation and state update. Take and iteration as the example. In the first step, separate messages and are calculated by summing up the messages of its children and parent in the dependency tree, respectively:


where and represent all edges with a head word and a modifier word , respectively, and represents the embedding of label , the reverse version of original label (such as “amod-rev” is the reverse version of “amod”). The message from a child or a parent is obtained by simply concatenating its hidden state with the corresponding edge label embedding.

In the second step, GRN uses standard gated operations of LSTM (Hochreiter and Schmidhuber, 1997) to update hidden state with the previously integrated message. In particular, a cell is taken to record memory for ; an input gate , an output gate and a forget gate are used to control information flow from the inputs and to the output :


where , , and () are model parameters, and

is initialized as a vector of zeros.

The same process repeats for iterations. Starting from of the Bi-LSTM layer, increasingly more informed hidden states are obtained as the iteration increases, and is used as the final representation of each word.

4.3 Relation prediction

Given of the GRN encoding, we calculate the representation vector of the two related entity mentions and (such as “almorexant” and “orexin receptor” in Figure 1) with mean pooling:


where and represent the span of and , respectively, and is the mean-pooling function.

Finally, the representations of both mentions are concatenated to be the input of a logistic regression classifier:


where and are model parameters.

5 Model

In this section, we first discuss how to generate high-quality dependency forests, before showing how to adapt GRN to consider the parser probability of each dependency edge.

5.1 Forest generation

Given a dependency parser, generating dependency forests with high recall and low noise is a non-trivial problem. On the one hand, keeping the whole search space gives 100% recall, but introduces maximum noise. On the other hand, using the 1-best dependency tree can result in low recall given an imperfect parser. We investigate two algorithms to generate high-quality forests by judging “quality” from different perspectives: one focusing on arcs, and the other focusing on trees.


This algorithm focuses on the local relation of each individual edge and uses parser probabilities as confidence scores to assess edge qualities. Starting from the whole parser search space, it keeps all the edges with scores greater than a threshold . The time complexity is , where represents the sentence length.111More accurately, it is and s a constant factor, denoting the number of distinct dependency labels. We omit it for simplicity.


This algorithm extends the Eisner algorithm (Eisner, 1996) with cube pruning (Huang and Chiang, 2005) for finding highest-scored tree structures. The Eisner algorithm is a standard method for decoding 1-best trees for graph-based dependency parsing. Based on bottom-up dynamic programming, it stores the 1-best subtree for each span and takes time complexity for decoding a sentence of words.

KBestEisner keeps a sorted list of -best hypotheses for each span. Cube pruning (Huang and Chiang, 2005) is adopted to generate the -best list for each larger span from the -best lists of its sub-spans. After the bottom-up decoding, we merge the final -bests by combining identical dependency edges to make the forest. As a result, KBestEisner takes time.

Discussions   Edgewise is much simpler and faster than KBestEisner. Compared with the time complexity of KBestEisner, Edgewise only takes running time, and each step (storing an edge) runs faster than KBestEisner (making a new hypothesis by combining two from sub-spans). Besides, the forests of Edgewise can be denser and provide richer information than those from KBestEisner. This is because KBestEisner only merges trees, where many edges are shared among them. Also, cannot be set to a large number (such as 100), because that will cause a dramatic increase of running time.

Compared with KBestEisner, Edgewise suffers from two potential problems. First, Edgewise does not guarantee to produce a 1-best tree in a generated forest, as it makes decisions by considering the individual edges. Second, it does not guarantee to generate spanning forests, which can happen when the threshold is high. On the other hand, no previous work has shown that the information from the whole tree is crucial for relation extraction. In fact, many previous studies use only the dependency path between the target entity mentions (Bunescu and Mooney, 2005; Airola et al., 2008; Chowdhury et al., 2011; Gormley et al., 2015; Mehryary et al., 2016). We study the effectiveness of both algorithms in our experiments.

5.2 GRN encoding with parser confidence

As illustrated by Figure 1(b), our dependency forests are directed graphs that can be consumed by GRN without any structural changes. For fair comparison, we use the same model as the baseline to encode sentences and forests. Thus our model uses the same number of parameters as our baseline taking 1-best trees.

Since forests contain more than one tree, it is intuitive to consider parser confidence scores for potentially better feature extraction. To this end, we slightly adjust the GRN encoding process without introducing additional parameters. In particular, we enhance the original message sum function (Equations

5 and 6) by applying the edge probabilities in calculating weighted message sums:


where (instead of a triple) is used to represent an edge for simplicity, and is the parser probability for edge . The edge probabilities are not adjusted during end-task training.

6 Training

Relation loss   Given a set of training instances, each containing a sentence with two target mentions and , and a dependency structure (tree or forest), we train our models with a cross-entropy loss between the gold-standard relations and model distribution:


where represents the model parameters.

Using additional NER loss   For training on BioCreative VI CPR, we follow previous work (Liu et al., 2017; Verga et al., 2018) to take NER loss as additional supervision, though the mention boundaries are known during testing.


where is the gold NE tag of

with the “BIO” scheme. Both losses are conditionally independent given the deep features produced by our model, and the final loss for BioCreative VI CPR training is


7 Experiments

We conduct experiments on two medical benchmarks to test the usefulness of dependency forest.

7.1 Data

BioCreative VI CPR (Krallinger et al., 2017)

This task222https://biocreative.bioinformatics.udel.edu/tasks/biocreative-vi/track-5/ focuses on the relations between chemical compounds (such as drugs) and proteins (such as genes). The full corpus contains 1020, 612 and 800 extracted PubMed333https://www.ncbi.nlm.nih.gov/pubmed/ abstracts for training, development and testing, respectively. All abstracts are manually annotated with the boundaries of entity mentions and the relations. The data provides three types of NEs: “CHEMICAL”, “GENE-Y” and “GENE-N”, and the relation set contains 5 regular relations (“CPR:3”, “CPR:4”, “CPR:5”, “CPR:6” and “CPR:9”) and the “None” relation.

For efficient generation of dependency structures, we segment each abstract into sentences, keeping only the sentences that contain at least a chemical mention and a protein mention. For any sentence containing several chemical mentions or protein mentions, we keep multiple copies of it with each copy having different target mention pairs. As a result, we only consider the relations of mentions in the same sentence, assigning all cross-sentence chemical-protein pairs as “None” relation. By doing this, we effectively sacrifice cross-sentence relations, which has a negative effect on our systems; but this is necessary for efficient generation of dependency structures since directly parsing a short paragraph is slow and erroneous.444Peng et al. (2017) describe a solution for cross-sentence cases, which joins different dependency structures by connecting their roots. We leave it for future work. In general, we obtain 16,107 training, 10,030 development and 14,269 testing instances, in which around 23% have regular relations. The highest recalls for relations on our development and test sets are 92.25 and 92.54, respectively, because of the exclusion of cross-sentence relations in preprocessing. We report F1 scores of the full test set for a fair comparison, using all gold regular relations to calculate recalls.

Phenotype-Gene relation (PGR) Sousa et al. (2019)   This dataset concerns the relations between human phenotypes (such as diseases) with human genes, where the relation set is a binary class on whether a phenotype is related to a gene. It has 18,451 silver training instances and 220 high-quality test instances, with each containing mention boundary annotations. We separate the first 15% training instances as our development set. Unlike BioCreative VI CPR, almost every relation of PGR is within a single sentence.

7.2 Models

We compare the following models:

  • TextOnly: It does not take dependency structures and directly uses the Bi-LSTM outputs ( in Eq. 3) to make predictions.

  • DepTree: Our baseline using 1-best dependency trees, as shown in Section 4.

  • EdgewisePS and Edgewise: Our models using the forests generated by our Edgewise algorithm with or without parser scores.

  • KBestEisnerPS and KBestEisner: Our model using the forests generated by our KBestEisner algorithm with or without parser scores, respectively.

7.3 Settings

We take a state-of-the-art deep biaffine parser (Dozat and Manning, 2017), trained on the Penn Treebank (PTB) (Marcus and Marcinkiewicz, 1993) converted to Universal Dependency, to obtain 1-best trees and full search spaces for generating forests. Using standard PTB data split (02–21 for training, 22 for development and 23 for testing), it gives UAS and LAS scores of 95.7 and 94.6, respectively.

For the other hyper-parameters, word embeddings are initialized with the 200-dimensional BioASQ vectors555http://bioasq.lip6.fr/tools/BioASQword2vec/, pretrained on 10M abstracts of biomedical articles, and are fixed during training. The dimension of hidden vectors in Bi-LSTM is set to 200, and the number of message passing steps is set to 2 based on Zhang et al. (2018b). We use Adam (Kingma and Ba, 2014), with a learning rate of 0.001, as the optimizer. The batch size, coefficient for normalization loss and dropout rate are 20, and 0.1, respectively.

7.4 Analyses of generated forests

#Edge/#Node LAS Conn. Ratio(%)
0.05 2.09 92.5 100.0
0.1 1.57 91.2 99.5
0.2 1.34 90.5 94.2
0.3 1.04 88.0 77.6
#Edge/#Node LAS Conn. Ratio(%)
1 1.00 86.4 100.0
2 1.03 87.3 100.0
5 1.09 89.1 100.0
10 1.14 89.8 100.0
Table 1: Statistics on forests generated with various (upper half) and (lower half) on the development set.

Table 1 demonstrates several characteristics of the generated forests of both the Edgewise and KBestEisner algorithms in Section 5.1, where “#Edge/#Sent” measures the forest density with the number of edges divided by the sentence length, “LAS” represents the oracle LAS score on 100 biomedical sentences with manually annotated dependency trees, and “Conn. Ratio (%)” shows the percentage of forests where both related entity mentions are connected.

Regarding the forest density, forests produced by Edgewise generally contain more edges than those from KBestEisner. Due to the combinatorial property of forests, Edgewise can give much more candidate trees (and sub-trees) for the whole sentence (and each sub-span). This coincides with the fact that the forests generated by Edgewise have higher oracle scores than these generated by KBestEisner.

For connectivity, KBestEisner guarantees to generate spanning forests. On the other hand, the connectivity ratio for the forests produced by Edgewise drops when increasing the threshold . We can have more than 94% being connected with . Later we will show that good end-task performance can still be achieved with the 94% connectivity ratio. This indicates that losing connectivity for a small potion of the data may not hurt the overall performance.

7.5 Development results

(a) Edgewise
(b) KBestEisner
Figure 3: Development results (F1 score) for our forest generation methods.

Figure 3 shows the development experiments for our forest generation algorithms, where both Edgewise and KBestEisner give consistent improvements over DepTree and TextOnly. Generally, Edgewise gives more improvements than KBestEisner. The main reason may be that Edgewise generates denser forests, providing richer features. On the other hand, KBestEisner shows a marginal improvement by increasing from 5 to 10. This indicates that only merging 10-best trees may be far from sufficient. However, using a much larger (such as 100) is not practical due to dramatically increased computation time. In particular, the running time of KBestEisner with is already much longer than that of Edgewise. As a result, Edgewise better serves our goal compared to KBestEisner. This may sound surprising, as Edgewise does not consider tree-level scores. It suggests that relation extraction may not require full dependency tree features. This coincides with previous relation extraction research (Bunescu and Mooney, 2005; Airola et al., 2008), which utilizes the shortest path connecting the two candidate entities in the dependency tree.

Leveraging parser confidence scores also consistently helps both methods. It is especially effective for Edgewise when . This is likely because the parser confidence scores are useful for distinguishing some erroneous dependency arcs, when noise is large (e.g. when is too small). Following the development results, we directly report the performances of EdgewisePS and KBestEisnerPS, setting and to 0.2 and 10, respectively, in our remaining experiments.

7.6 Main results on BioCreative VI CPR

Model F1 score
GRU+Attn (Liu et al., 2017) 49.5
Bran (Verga et al., 2018) 50.8
TextOnly 50.6
DepTree 51.4
KBestEisnerPS **52.4**
EdgewisePS **53.4**
Table 2: Test results of Biocreative VI CPR. indicates previously reported numbers. ** means significant over DepTree at with 1000 bootstrap tests (Efron and Tibshirani, 1994).

Table 2 shows the main comparison results on the BioCreative CPR testset, with comparisons to the previous state-of-the-art and our baselines. GRU+Attn (Liu et al., 2017) stacks a self-attention layer on top of GRU (Cho et al., 2014) and embedding layers; Bran (Verga et al., 2018)

adopts a biaffine self-attention model to simultaneously extract the relations of all mention pairs. Both methods use only textual knowledge.

TextOnly gives a performance comparable with Bran. With 1-best dependency trees, our DepTree baseline gives better performances than the previous state of the art. This confirms the usefulness of dependency structures and the effectiveness of GRN on encoding these structures. Using dependency forests and parser confidence scores, both KBestEisnerPS and EdgewisePS obtain significantly higher numbers than DepTree. Consistent with the development experiments, EdgewisePS has a higher testset performance than KBestEisnerPS.

7.7 Analysis

Effectiveness on parsing accuracy

We have shown in Sections 7.5 and 7.6 that a dependency parser trained using a domain-general treebank can produce high-quality dependency forests in a target domain (biomedical) for helping relation extraction. This is based on the assumption of there being a high-quality treebank in a descent scale, which may not be true for low-resource languages. We simulate this low-resource effect by training our parser in much smaller treebanks of 1K or 5K dependency trees, respectively. The LAS scores for the resulting parsers on our 100 manually annotated biomedical dependency trees are 79.3 and 84.2, respectively, while the LAS score for the parser trained with the full treebank is 86.4, as shown in Table 1.

Figure 4 shows the results on the Biocreative CPR development set, where the performance of TextOnly is 51.6. DepTree fails to outperform TextOnly when only 1K or 5K dependency trees are available for training our parser. This is due to the low parsing recall and subsequent noise caused by the weak parsers. It confirms the previous conclusion that dependency structures are highly influential to the performance of relation extraction. Both EdgewisePS and KBestEisnerPS are still more effective than DepTree. In particular, KBestEisnerPS significantly improves TextOnly with 5K dependency trees, and EdgewisePS is helpful even with 1K dependency trees.

Figure 4: Dev results of BioCreative CPR regarding the dependency parsers trained on different number (1K, 5K or Full) of dependency trees.

KBestEisner shows relatively smaller gaps than Edgewise when only a limited number of dependency trees are available. This is probably because considering whole-tree quality helps to better eliminate noise.

Case study   Figure 5 illustrates two major types of errors in BioCreative CPR, which are caused by inaccurate 1-best dependency trees. As shown in Figure 5(a), the baseline system mistakenly predicts a “None” relation for that instance. This is mainly because “STAT3” is incorrectly linked to the main verb “inhibited” with a “punct” relation, but it should be linked to “AKT”. In contrast, our forest contains the correct relation and with a probability of 0.18. This is possibly because “AKT and STAT3” fits the common pattern of “A and B” that conjunct two nouns.

Figure 5: Two representative cases in BioCreative CPR, contrasting 1-best trees and forests, where irrelevant content and arcs are omitted for simplicity.
Model F1 score
BO-LSTM (Lamurias et al., 2019) 52.3
BioBERT (Lee et al., 2019) 67.2
TextOnly 76.0
DepTree 78.9
KBestEisnerPS *83.6*
EdgewisePS **85.7**
Table 3: Main results on PGR testest. denotes previous numbers rounded into 3 significant digits. * and ** indicate significance over DepTree at and with 1000 bootstrap tests.

Figure 5(b) shows another type of parsing errors that cause end-task mistakes. In this example, the multi-token mention “calcium modulated cyclases” is incorrectly segmented in the 1-best dependency tree, where “modulated” is used as the main verb of the whole sentence, leaving “cyclases” and “calcium” as the object and the modifier of the subject, respectively. However, this mention ought to be a noun phrase with “cyclases” being the head. Our forest helps in this case by providing a more reasonable structure (shown as the yellow dashed arcs), where both “calcium” and “modulated” modify “cyclases”. This is likely because “modulated” can be interpreted as an adjective in addition to being a verb. It shows the advantage of keeping multiple candidate syntactic arcs.

7.8 Main results on PGR

Table 3 shows the comparison with previous work on the PGR testset, where our models are significantly better than the existing models. This is likely because the previous models do not utilize all the information from inputs: BO-LSTM only takes the words (without arc labels) along the shortest dependency path between the target mentions; the pretrained weights of BioBERT are kept constant during training for relation extraction.

With 1-best trees, DepTree is 2.9 points better than TextOnly, confirming the usefulness of dependency structures. Leveraging dependency forests, both KBestEisnerPS and EdgewisePS significantly outperform DepTree with -values of 0.003 and 0.024, respectively. This further confirms the usefulness of dependency forests for medical relation extraction.

7.9 Main results on SemEval-2010 task 8

Model F1 score
C-GCN (Zhang et al., 2018b) 84.8
C-AGGCN (Guo et al., 2019) 85.7
DepTree 84.6
KBestEisnerPS 85.8
EdgewisePS 86.3
Table 4: Main results on SemEval-2010 task 8 testest. denotes previous numbers.

In addition to the biomedical domain, leveraging dependency forests applies to other domains as well. As shown in Table 4, we conduct a preliminary study on SemEval-2010 task 8 (Hendrickx et al., 2009), a widely used benchmark for news-domain relation extraction. It is a public dataset, containing 10,717 instances (8000 for training and development, 2717 for testing) with 19 relations: 9 directed relations and a special “Other” class. Both C-GCN and C-AGGCN take a similar network as ours by stacking a graph neural network for encoding trees on top of a Bi-LSTM layer for encoding sentences.

DepTree achieves similar performance as C-GCN and is slightly worse than C-AGGCN, with one potential reason being that C-AGGCN takes more parameters. Using forests, both KBestEisnerPS and EdgewisePS outperform DepTree with the same number of parameters, and they show comparable and slightly better performances than C-AGGCN. Again, EdgewisePS is better than KBestEisnerPS, showing that the former is a better way for generating forests.

8 Conclusion

We have proposed two algorithms for generating high-quality dependency forests for relation extraction, and studied a graph recurrent network for effectively distinguishing useful features from noise in parsing forests. Experiments on two biomedical relation extraction benchmarks show the superiority of forests versus tree structures, without introducing any additional model parameters. Our deep analyses indicate that the main advantage comes from alleviating out-of-domain parsing errors.


Research supported by NSF award IIS-1813823.


  • A. B. Abacha and P. Zweigenbaum (2011) Automatic extraction of semantic relations between medical entities: a rule based approach. Journal of biomedical semantics 2 (5). Cited by: §1.
  • A. Airola, S. Pyysalo, J. Björne, T. Pahikkala, F. Ginter, and T. Salakoski (2008) All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning. BMC bioinformatics 9 (11), pp. S2. Cited by: §5.1, §7.5.
  • J. Bastings, I. Titov, W. Aziz, D. Marcheggiani, and K. Simaan (2017)

    Graph convolutional encoders for syntax-aware neural machine translation


    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

    Cited by: §2, §3.
  • D. Beck, G. Haffari, and T. Cohn (2018) Graph-to-sequence learning using gated graph neural networks. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Cited by: §3.
  • R. Bunescu and R. Mooney (2005) A shortest path dependency kernel for relation extraction. In Proceedings of the conference on human language technology and empirical methods in natural language processing, Cited by: §1, §5.1, §7.5.
  • M. Candito, E. H. Anguiano, and D. Seddah (2011) A word clustering approach to domain adaptation: effective parsing of biomedical texts. In Proceedings of the 12th International Conference on Parsing Technologies, Cited by: §1.
  • K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §7.6.
  • F. Md. Chowdhury, A. Lavelli, and A. Moschitti (2011) A study on dependency tree kernels for automatic extraction of protein-protein interaction. In Proceedings of BioNLP 2011 Workshop, Cited by: §5.1.
  • A. Culotta and J. Sorensen (2004) Dependency tree kernels for relation extraction. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), Main Volume, Cited by: §1.
  • T. Dozat and C. D. Manning (2017) Deep biaffine attention for neural dependency parsing. In Proceedings of International Conference on Learning Representations, Cited by: §7.3.
  • B. Efron and R. J. Tibshirani (1994) An introduction to the bootstrap. CRC press. Cited by: Table 2.
  • J. M. Eisner (1996) Three new probabilistic models for dependency parsing: an exploration. In Proceedings of the 16th conference on Computational linguistics-Volume 1, Cited by: §5.1.
  • C. Friedman, P. Kra, H. Yu, M. Krauthammer, and A. Rzhetsky (2001) GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics 17 (1). Cited by: §1.
  • M. R. Gormley, M. Yu, and M. Dredze (2015) Improved relation extraction with feature-rich compositional embedding models. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1774–1784. Cited by: §1, §5.1.
  • Z. Guo, Y. Zhang, and W. Lu (2019) Attention guided graph convolutional networks for relation extraction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 241–251. Cited by: Table 4.
  • I. Hendrickx, S. N. Kim, Z. Kozareva, P. Nakov, D. Ó Séaghdha, S. Padó, M. Pennacchiotti, L. Romano, and S. Szpakowicz (2009)

    Semeval-2010 task 8: multi-way classification of semantic relations between pairs of nominals. In Proceedings of SemEval, pp. 94–99. Cited by: §7.9.
  • L. Hirschman, A. Yeh, C. Blaschke, and A. Valencia (2005) Overview of biocreative: critical assessment of information extraction for biology. BioMed Central. Cited by: §1.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8). Cited by: §4.2.
  • L. Huang and D. Chiang (2005) Better k-best parsing. In Proceedings of the Ninth International Workshop on Parsing Technology, Cited by: §5.1, §5.1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §7.3.
  • N. Kitaev and D. Klein (2018) Constituency parsing with a self-attentive encoder. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Cited by: §1.
  • M. Krallinger, O. Rabal, S. A. Akhondi, et al. (2017) Overview of the biocreative vi chemical-protein interaction track. In Proceedings of the VI BioCreative challenge evaluation workshop, Cited by: §1, §7.1.
  • A. Lamurias, D. Sousa, L. A. Clarke, and F. M. Couto (2019) BO-LSTM: classifying relations via long short-term memory networks along biomedical ontologies. BMC bioinformatics 20 (1), pp. 10. Cited by: Table 3.
  • P. Le and W. Zuidema (2015)

    The forest convolutional network: compositional distributional semantics with a neural chart and without binarization

    In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Cited by: §2.
  • M. Lease and E. Charniak (2005) Parsing biomedical literature. In International Conference on Natural Language Processing, Cited by: §1.
  • J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang (2019) BioBERT: pre-trained biomedical language representation model for biomedical text mining. arXiv preprint arXiv:1901.08746. Cited by: Table 3.
  • J. Liu and Y. Zhang (2017) In-order transition-based constituent parsing. Transactions of the Association for Computational Linguistics 5. Cited by: §1.
  • S. Liu, F. Shen, Y. Wang, M. Rastegar-Mojarad, R. K. Elayavilli, V. Chaudhary, and H. Liu (2017) Attention-based neural networks for chemical protein relation extraction. In Proceedings of the BioCreative VI Workshop, Cited by: §6, §7.6, Table 2.
  • Y. Liu, F. Wei, S. Li, H. Ji, M. Zhou, and H. WANG (2015) A dependency-based neural network for relation classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Cited by: §1.
  • W. Lu and H. T. Ng (2011) A probabilistic forest-to-string model for language generation from typed lambda calculus expressions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Cited by: §2.
  • C. Ma, A. Tamura, M. Utiyama, T. Zhao, and E. Sumita (2018) Forest-based neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Cited by: §2.
  • D. Marcheggiani and I. Titov (2017) Encoding sentences with graph convolutional networks for semantic role labeling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Cited by: §2, §3, §4.2.
  • M. P. Marcus and M. A. Marcinkiewicz (1993) Building a large annotated corpus of English: the Penn treebank. Computational Linguistics 19 (2). Cited by: §7.3.
  • D. McClosky and E. Charniak (2008) Self-training for biomedical parsing. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, Cited by: §1.
  • F. Mehryary, J. Björne, S. Pyysalo, T. Salakoski, and F. Ginter (2016) Deep learning with minimal training data: TurkuNLP entry in the BioNLP shared task 2016. In Proceedings of the 4th BioNLP Shared Task Workshop, Cited by: §5.1.
  • H. Mi, L. Huang, and Q. Liu (2008) Forest-based translation. In Proceedings of ACL-08: HLT, Cited by: §2.
  • M. Miwa and M. Bansal (2016) End-to-end relation extraction using lstms on sequences and tree structures. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Cited by: §1.
  • N. Peng, H. Poon, C. Quirk, K. Toutanova, and W. Yih (2017) Cross-sentence n-ary relation extraction with graph LSTMs. Transactions of the Association for Computational Linguistics 5, pp. 101–115. Cited by: §1, §2, §4, footnote 4.
  • C. Quirk and H. Poon (2017) Distant supervision for relation extraction beyond the sentence boundary. In Proceedings of the 15th Conference of the European Chapter of the ACL (EACL-17), Cited by: §1, §3.
  • K. Sagae, Y. Miyao, R. Sætre, and J. Tsujii (2008) Evaluating the effects of treebank size in a practical application for parsing. In Software Engineering, Testing, and Quality Assurance for Natural Language Processing, Cited by: §1.
  • P. Sondhi, M. Gupta, C. Zhai, and J. Hockenmaier (2010) Shallow information extraction from medical forum data. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, Cited by: §1.
  • L. Song, D. Gildea, Y. Zhang, Z. Wang, and J. Su (2019) Semantic neural machine translation using amr. Transactions of the Association for Computational Linguistics 7, pp. 19–31. Cited by: §2, §3.
  • L. Song, Y. Zhang, Z. Wang, and D. Gildea (2018a) A graph-to-sequence model for amr-to-text generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1616–1626. Cited by: §3.
  • L. Song, Y. Zhang, Z. Wang, and D. Gildea (2018b) N-ary relation extraction using graph-state lstm. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2226–2235. Cited by: §1, §2, §4.
  • D. Sousa, A. Lamúrias, and F. M. Couto (2019) A silver standard corpus of human phenotype-gene relations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Cited by: §1, §7.1.
  • Z. Tu, Y. Liu, Y. Hwang, Q. Liu, and S. Lin (2010) Dependency forest for statistical machine translation. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), Cited by: §2.
  • P. Verga, E. Strubell, and A. McCallum (2018) Simultaneously self-attending to all mentions for full-abstract biological relation extraction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Cited by: §6, §7.6, Table 2.
  • H. Xu, S. P. Stenner, S. Doan, K. B. Johnson, L. R. Waitman, and J. C. Denny (2010) MedEx: a medication information extraction system for clinical narratives. Journal of the American Medical Informatics Association 17 (1). Cited by: §1.
  • K. Xu, Y. Feng, S. Huang, and D. Zhao (2015a) Semantic relation classification via convolutional neural networks with simple negative sampling. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Cited by: §1.
  • K. Xu, L. Wu, Z. Wang, M. Yu, L. Chen, and V. Sheinin (2018) Exploiting rich syntactic information for semantic parsing with graph-to-sequence model. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Cited by: §2.
  • Y. Xu, L. Mou, G. Li, Y. Chen, H. Peng, and Z. Jin (2015b) Classifying relations via long short term memory networks along shortest dependency paths. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Cited by: §1.
  • Y. Yin, L. Song, J. Su, J. Zeng, C. Zhou, and J. Luo (2019) Graph-based neural sentence ordering. In Proceedings of IJCAI, Cited by: §2.
  • H. Yu and E. Agichtein (2003) Extracting synonymous gene and protein terms from biological literature. Bioinformatics 19. Cited by: §1.
  • P. Zaremoodi and G. Haffari (2018) Incorporating syntactic uncertainty in neural machine translation with a forest-to-sequence model. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 1421–1429. Cited by: §2.
  • Y. Zhang, Q. Liu, and L. Song (2018a) Sentence-state lstm for text representation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 317–327. Cited by: §3.
  • Y. Zhang, P. Qi, and C. D. Manning (2018b) Graph convolution over pruned dependency trees improves relation extraction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Cited by: §1, §7.3, Table 4.