Log In Sign Up

Combining Long Short Term Memory and Convolutional Neural Network for Cross-Sentence n-ary Relation Extraction

We propose in this paper a combined model of Long Short Term Memory and Convolutional Neural Networks (LSTM-CNN) that exploits word embeddings and positional embeddings for cross-sentence n-ary relation extraction. The proposed model brings together the properties of both LSTMs and CNNs, to simultaneously exploit long-range sequential information and capture most informative features, essential for cross-sentence n-ary relation extraction. The LSTM-CNN model is evaluated on standard dataset on cross-sentence n-ary relation extraction, where it significantly outperforms baselines such as CNNs, LSTMs and also a combined CNN-LSTM model. The paper also shows that the LSTM-CNN model outperforms the current state-of-the-art methods on cross-sentence n-ary relation extraction.


page 1

page 2

page 3

page 4


Cross-Sentence N-ary Relation Extraction with Graph LSTMs

Past work in relation extraction has focused on binary relations in sing...

N-ary Relation Extraction using Graph State LSTM

Cross-sentence n-ary relation extraction detects relations among n entit...

Ensemble Neural Relation Extraction with Adaptive Boosting

Relation extraction has been widely studied to extract new relational fa...

BB_twtr at SemEval-2017 Task 4: Twitter Sentiment Analysis with CNNs and LSTMs

In this paper we describe our attempt at producing a state-of-the-art Tw...

CNNs, LSTMs, and Attention Networks for Pathology Detection in Medical Data

For the weakly supervised task of electrocardiogram (ECG) rhythm classif...


Research in the field of relation extraction has largely focused on identifying binary relations that exist between two entities in a single sentence, known as intra-sentence relation extraction [Bach and Badaskar2007]. However, relations can exist between more than two entities that appear across consecutive sentences. For example, in the text span comprising the two consecutive sentences in Listing 1, there exists a ternary relation response across three entities: EGFR, L858E, gefitnib appearing across sentences. This relation extraction task, focusing on identifying relations between more than two entities – either appearing in a single sentence or across sentences, is known as cross-sentence -ary relation extraction.

Listing 1: Text span of two consecutive sentences

  1. The deletion mutation on exon-19 of EGFR gene was present in 16 patients, while the L858E point mutation on exon-21 was noted in 10. All patients were treated with gefitnib and showed a partial response.

This paper focuses on the cross-sentence -ary relation extraction task. Formally, let be the set of entities in a text span containing number of consecutive sentences. For example, in the text span comprising 2 sentences () in Listing 1 above, given cancer patients with mutation (EGFR) in gene (L858E), the patients showed a partial response to drug (gefitnib). Thus, a ternary relation response(EGFR, L858E, gefitnib) exists among the three entities spanning across the two sentences in Listing 1. The entities in the text span can either appear in a single sentence () or multiple sentences (). Thus, given an instance defined as a combined sequence of tokens in consecutive sentences and a set of entities , the cross-sentence -ary relation extraction task is to predict an -ary relation (if exists) among the entities in .

Cross-sentence -ary relation extraction is particularly challenging compared to intra-sentence relation extraction for several reasons. Lexico-syntactic pattern-based relation extraction methods [Hearst1992, Brin1998, Agichtein and Gravano2000], have shown to be highly effective for intra-sentence relation extraction. Unfortunately, such pattern-based relation extraction methods cannot be readily applied to cross-sentence

-ary relation extraction because it is difficult to match lexico-syntactic patterns across longer text spans such as covering multiple sentences. Features extracted from the dependency parse trees for individual sentences 

[Culotta and Sorensen2004, Bunescu and Mooney2005, Fundel, Küffner, and Zimmer2006, Xu et al.2015, Miwa and Bansal2016] have found to be extremely useful for intra-sentence relation extraction. However, it is non-obvious as how to merge dependency parse trees from different sentences to extract path-based features for cross-sentence relation extraction. Moreover, difficulties in coreference resolution and discourse analysis, further complicate the problem of detecting relations among entities across sentences [Elango2005].

The principal challenges for cross-sentence -ary relation extraction arise from (a) difficulties in handling long-range sequences resulting from combining multiple sentences, (b) modeling the contexts of words related to different entities present in different sentences, and (c) the problem of representing a variable-length text span containing an -ary relation using a fixed-length representation. To address these issues, we propose a combined model consisting a Long Short-Term Memory unit and a Convolutional Neural Network (lstm_cnn) that exploits both word embedding and positional embedding features for cross-sentence -ary relation extraction. The LSTM is used as the first layer to encode the combined set of sentences representing an -ary relation, thereby capturing the long-range sequential information. The hidden state representations obtained from the lstm is then used with the cnn to further identify the salient features for relation classification. Our main contributions in this paper can be summarised as follows:

  • Propose an lstm_cnn model that exploits word embedding and position embedding features for cross-sentence -ary relation extraction. We compare the proposed model against multiple baselines such as cnn, lstm and a combined cnn_lstm model. Experimental results show that the proposed model significantly outperforms all baselines.

  • Evaluate the proposed model against state-of-the-art (SOTA) for cross-sentence -ary relation extraction on two different benchmark datasets. Results show that the proposed model significantly outperforms the current SOTA methods for cross-sentence -ary relation extraction.

Related Work

There is a large body of research on intra-sentence relation extraction [Bach and Badaskar2007]. However, our main focus in this paper is on cross-sentence relation extraction. Therefore, we will limit our discussion below to the cross-sentence relation extraction. Research on cross-sentence relation extraction has extensively used features drawn from dependency trees [Swampillai and Stevenson2010, Quirk and Poon2016, Peng et al.2017], tree kernels [Moschitti, Patwardhan, and Welty2013, Nagesh2016], and graph LSTMs [Peng et al.2017]. Further, studies on inter-sentence relation extraction have limited their attention on extracting binary relations present across sentences [Swampillai and Stevenson2010, Quirk and Poon2016, Moschitti, Patwardhan, and Welty2013, Nagesh2016]. Recently, peng2017cross peng2017cross proposed graph-LSTMs not only to consider binary relations, but also for -ary relations across sentences. Although graph LSTMs are useful to model -ary relations across sentences, the process of creating directed acyclic graphs covering words in multiple sentences is complex and error-prone. It is non-obvious as where to connect two parse trees and the parse errors compound during the graph creation step. Moreover, co-reference resolution and discourse features used by peng2017cross peng2017cross do not always improve performance of cross-sentence relation extraction.

We present a neural network-based approach that does not rely on heavy syntactic features such as dependency trees, co-reference resolution or discourse features for cross-sentence -ary relation extraction. Although, previous studies have explored LSTMs and CNNs separately for cross-sentence -ary relation extraction, we propose in this paper, a combined model of lstm_cnn network that simply takes as input the combined sequence of sentences containing -ary relations. While, LSTMs generate features that preserve long-range relations among words in the combined sequence of sentences, CNNs can generate different weighted combinations of those features and select the most informative ones via pooling. Although recently several studies have explored combining CNNs and RNNs for various NLP tasks such as text classification [Lai et al.2015, Lee and Dernoncourt2016, Hsu et al.2017]

and sentiment analysis

[Wang, Jiang, and Luo2016], to the best of our knowledge, we are the first to propose a combined lstm_cnn model for cross-sentence -ary relation extraction.

Cross-Sentence -ary Relation Extraction

The architecture of the proposed lstm_cnn+wf+pf model - combined lstm_cnn using word features (wf) and positional features (pf) for cross-sentence -ary relation extraction is shown in Figure 1. Next, we describe the different components of the proposed model.

Figure 1: Architecture of the lstm_cnn+wf+pf model for cross-sentence -ary relation extraction. The input to the network is the sequence of tokens from text span (with two sentences and three entities) shown in Listing 1. The position features are derived for highlighted entities ( and ).

Input Representation

The input to the lstm_cnn model is the combined sequence of tokens in a text span comprising consecutive sentences where an -ary relation exists between two entities. The sequence of tokens is transformed into a combination of word embeddings and position embeddings as follows:

Word Embeddings

The transformation of words into lower dimensional vectors are observed to be useful in capturing semantic and syntactic information about words 

[Mikolov et al.2013, Pennington, Socher, and Manning2014]. Thus, each of the words in the combined sequence is mapped to a dimensional embedding vector using a look-up matrix where is the number of unique words in the vocabulary.

Position Features

Following zeng2014relation zeng2014relation, positional features (PFs) are used to encode the position of entities for -ary cross-sentence relation extraction. Given entity mentions in the sequence , Although pfs can be defined based on entities, the proposed model, specifically considers only and to create position embeddings for the input sequence because preliminary experiments show that having pfs decreases the performance of the model. Thus, the model defines two sets of pfs and for the entities and , respectively, as a combination of the relative distances from the current word to the respective entity. The position embedding matrices are randomly initialised and the relative distance of words entities are transformed into real valued vectors by looking up the position embedding matrices.

Thus, the vector representation for models using position features, transforms an instance into a matrix by combining the word embeddings and position embeddings, where is the sentence length and ( and are the dimensions of respectively the word and position embeddings).

LSTM Layer

Although RNNs are useful in learning from sequential data, these networks are observed to suffer from the problem of exploding or vanishing gradient, which makes it difficult for RNNs to learn long distance correlations in a sequence [Hochreiter and Schmidhuber1997, Hochreiter et al.2001]. To specifically address this issue of learning long-range dependencies, LSTM [Hochreiter et al.2001] was proposed which maintains a separate memory cell that updates and exposes the content only when deemed necessary. Given the long-range sequential information resulting from combined set of sentences expressing an -ary relation, LSTM is an excellent choice to learn long-range dependencies. Thus, as shown in Figure 1, the transformed vector representation combining word embeddings and position features is provided as input to the LSTM layer. The LSTM units at each time step is defined as a collection of vectors in and comprises the following components: an input gate , a forget gate , an output gate , a memory cell and a hidden state . is number of LSTM units and the entries of the gating vectors and are in . The three adaptive gates and depend on the previous state and the current input (Equations 1-3). The candidate update vector (Equation 4) is also computed for the memory cell.


The current memory cell is a combination of the previous cell content and the candidate content , weighted respectively by the input gate and forget gate (Equation 5).


The hidden state , which is the output of the LSTM units is computed using the following equation:



denotes a sigmoid function and

denotes element-wise multiplication.

CNN Layer

Let be the -dimensional hidden state vector corresponding to the -th token in the combined sequence x. The combined hidden state vectors in the sequence of length is represented as:


where denotes vector concatenation. In general, let refer to the concatenation of hidden state vectors . The convolution operation involves a filter , which is is applied to a window of hidden state vectors to generate a new feature. For instance, a feature is generated from a window of hidden state vectors .


Here is the bias term and

is a non-linear function such as the rectified linear unit (ReLU). This filter is applied to each possible window of hidden state vectors in the combined sequence

to produce a feature map given by,


Max-pooling is applied over the feature map to take the maximum value as the feature corresponding to this particular filter. The use multiple filters and select the most important feature (one with the highest value) for each feature map. Finally, the use of multiple filters with varying window sizes result in obtaining a fixed length vector , where is the number of filters and is the number of different window sizes.

Predicting -ary Relations

The task of predicting -ary relations is modeled both as a binary and multi-class classification problem. The output feature vector

obtained from the convolution and max-pooling operation is passed to softmax layer, to obtain the probability distribution over relation labels. Dropout

[Srivastava et al.2014] is used on the output layer to prevent over-fitting. Thus, given a set of instances, with each instance being a text span comprising consecutive sentences (combined sequence of tokens ), entity mentions and having an -ary relation , the cross-entropy loss for this prediction is defined as follows:


where indicates the total number of text spans and indicates the parameters of the model.

Implementation details

The proposed model is implemented using Tensorflow

[Abadi et al.2016]

and will be made publicly available upon paper acceptance. The hyper-parameters of the models were set based on preliminary experiments on an independent development dataset. Training was performed following mini-batch gradient descent (SGD) with batch size of 10. The models were trained for at most 30 epochs, which was sufficient to converge. The dimensions of the hidden vectors for the LSTM was set to 300. The window sizes for CNN was set to 3,4 and 5, and experiments were conducted with different number of filters set to 10 and 128. Word embeddings were initialised using publicly available 300-dimensional Glove word vectors trained on a 6 billion token corpus from Wikipedia and web text

[Pennington, Socher, and Manning2014]. The dimensions for position embeddings was set to 100 and were initialised randomly between [-0.25, 0.25].



We conduct experiments using the following datasets.

Quirk and Poon (qp) Dataset

We use the dataset111 developed by quirk2016distant quirk2016distant and peng2017cross peng2017cross for the task of cross-sentence -ary relation extraction. Distant supervision was followed to extract relations involving drug, gene and mutation triples from the biomedical literature available in PubMed Central222 The idea of minimal span [Quirk and Poon2016] was used to avoid co-occurrence of the same entity triples and also to obtain spans with 3 consecutive sentences to avoid candidates where triples are far apart in the span. A total of 59 drug-gene-mutation triples was used to obtain 3,462 ternary relation instances and 3,192 binary relation instances (involving drug-mutation entities) as positive examples. The dataset has instances with ternary and binary relations, either appearing in a single sentence or across sentences. Each instances is labeled using four labels: ‘resistance’, ‘resistance or non-response’, ‘response’, and ‘sensitivity’. The label ‘none’ is used for negative instances. Negative samples were generated by randomly sampling co-occurring entity triples without known interactions, following the same restrictions used for obtaining positive samples. Negative examples were sampled as the same number of positive samples to develop a balanced dataset.

Chemical Induced Disease (cid) Dataset

We also evaluate the proposed model using the cid dataset333, which provides binary relation instances between chemicals and related diseases. We followed the methodology of gu2016chemical gu2016chemical to obtain relation instances from the corpus. Accordingly, a total of 1206, 1999 and 1330 positive instances were obtained for binary relations in single sentences and total of 702, 788 and 786 positive instances were binary relations across sentences, respectively. Negative instances were created following the same restrictions, however without any known interactions between entities.

SemEval-2010 Task 8 (se) Dataset

. The SemEval-2010 Task 8 dataset [Hendrickx et al.2009] is a standard dataset used intra-sentence relation extraction. The se dataset defines 9 relation types between nominals. The relation ‘other’ is used to denote negative type. The dataset consists of 8,000 training and 2,717 test sentences.

Evaluation Metrics

We conduct five-fold cross-validation and report average test accuracy on held-out folds experiments using q&p dataset, as prior work [Peng et al.2017] follow similar evaluation measures. To avoid training and test contamination, held-out evaluation is conducted in each fold, based on categorizing instances related to specific entity pairs (binary relations) or entity triples (ternary relations). For example, for binary relations, the instances relating to the first 70% of the entity pairs drawn from a unique list of entity pairs are used as training set. Instances relating to the next 10% and last 20% are used as development set and test set, respectively. For cid dataset

, the Precision, Recall and F-score on test set is reported, since the corpus is already divided in train, development and test set and also for comparison as previous studies

[Gu, Qian, and Zhou2016, Gu et al.2017, Zhou et al.2016] have used similar measures for reporting the performance. For se dataset, we used 10% of randomly selected instances from the training set as the development set. To evaluate the test set, the official task setting [Hendrickx et al.2009] was followed and we report the official macro-averaged F1-Score on the 9 relation types.

Baseline models

The proposed lstm_cnn+wf+pf model is evaluated against the following baseline models: (a) cnn+wf: a cnn model using word features alone; (b) cnn+wf+pf: a cnn model using word features and positional features; (c) lstm+wf: an lstm model using word features alone; (d) lstm+wf+pf: an lstm model using word features and positional features; (e) cnn_lstm+wf: a model that begins with a cnn layer followed by an lstm layer and uses word features only; (f) cnn_lstm+wf+pf: model that begins with a cnn layer followed by lstm layer and employs word features and position features; (g) lstm_cnn+wf: model that begins with an lstm layer followed by a cnn layer and employs word features only.

Results and Discussion

Performance of the proposed model.

The performance of the proposed model lstm_cnn+wf+pf for cross-sentence -ary relation extraction on q&p dataset is shown in Tables 1 and 2. As seen in Tables 1 and 2, the lstm_cnn+wf+pf model achieves statistically significant accuracy () against all baseline models such as cnn+wf, cnn+wf+pf, lstm+wf, lstm+wf+pf, cnn_lstm+wf, cnn_lstm+wf+pf and lstm_cnn+wf, for both cross-sentence ternary and binary relation extraction. The results showing the performance of the combined lstm_cnn model higher than cnn and lstm models in isolation, indicates the usefulness of such combined models for relation extraction. Combining lstm and cnn helps in bringing together the strength of lstms to learn from long sequences (input sequence) and the ability of cnns to identify salient features from the hidden-state output sequence from lstm for cross-sentence -ary relation extraction.

Given the above results, it is highly intriguing that a combined model of lstm and cnn using together word features (wf) and positional features (pf), outperforms the evaluated strong baselines. Interestingly, the use of wf alone already helps the combined model (lstm_cnn) in achieving higher performance against other baselines, particularly for extracting binary relations in single sentences and across sentences, and also ternary relations in single sentences (Tables 1 and 2 with ). However, it is the addition of pf that helps in drastically improving the performance for relation extraction. The pf clearly helps the combined lstm_cnn model by providing useful encoding of the position of words entities in the text span, which helps in achieving higher accuracy.

Further, the higher performance achieved in extracting both ternary and binary relations, particularly from cross-sentence text spans which are longer in sequence, indicates that the lstm_cnn+wf+pf model is highly suitable for extracting relations from longer sequences. Furthermore, the lstm_cnn+wf+pf model’s superior performance extracting ternary and binary relations from single sentences also indicates the suitability of the lstm_cnn+wf+pf model for relation extraction in single sentences. The evaluation results of the lstm_cnn+wf+pf on Semeval-2010 Task 8 dataset (standard dataset for intra-sentence relation extraction) presented later in this section, further confirms that the combined model (lstm_cnn) performs better than employing cnn and lstm in isolation for relation extraction in single sentences.

single cross
sentence sentences
=10 =128 =10 =128

72.5 75.5 75.2 76.3
cnn+wf+pf 73.3 73.9 78.5 78.7
lstm+wf - 75.0 - 78.2
lstm+wf+pf - 74.5 - 78.9
cnn_lstm+wf 77.6 75.4 76.9 75.3
cnn_lstm+wf+pf 72.0 53.0 76.8 62.6
lstm_cnn+wf 78.3 78.4 77.5 78.8
lstm_cnn+wf+pf 73.1 79.6* 80.5 82.9*

Table 1: Average test accuracy in five-fold cross-validation for drug-gene-mutation ternary interactions in qp dataset. - number of filters. lstm+wf and lstm+wf+pf models does not use filters
single cross
sentence sentences
=10 =128 =10 =128
cnn+wf 68.9 72.4 73.2 76.6
cnn+wf+pf 74.0 74.2 81.3 81.3
lstm+wf - 75.4 - 80.3
lstm+wf+pf - 74.4 - 80.8
cnn_lstm+wf 71.2 72.3 76.5 76.5
cnn_lstm+wf+pf 74.7 56.2 81.2 74.4
lstm_cnn+wf 74.9 76.7 79.7 82.0
lstm_cnn+wf+pf 85.3 85.8* 85.1 88.6*

Table 2: Average test accuracy in five-fold cross-validation for drug-gene binary interactions in qp dataset. - number of filters. lstm+wf and lstm+wf+pf models does not use filters

Where exactly does lstm_cnn model score?

To assess the contribution of lstm_cnn+wf+pf against the baseline models, we divided each dataset into three groups based on the distance between entity and in the text span. Specifically, we calculated the average number of tokens () between and

and the standard deviation (

) over different lengths of tokens between and in the dataset. Thus, if is the total number of tokens between and , the dataset was divided into the following three groups: (a) short-distance spans (); (b) medium-distance spans (); (c) long-distance spans (). Analysing the performance of models on different groups of spans divided in the above manner will provide insights into the model’s performance on different sequence lengths and the contribution of different features for relation extraction.

The performance of various models on three groups of sentences, divided based on the number of tokens between entities and in the text span is provided in Table 3. As seen in Table 3, the proposed lstm_cnn+wf+pf model score higher particularly for medium-distance spans () and long-distance spans (). For example, for short-distance and long-distance spans involving ternary relations across sentences, the lstm_cnn+wf+pf model predicts ternary relations correctly for a higher percent of and spans, respectively. Similarly, the percentage of correct predictions for binary relation extraction in single sentences and across sentences is significantly higher than the performance of other models. These results clearly indicate that the combined lstm_cnn model is more useful compared to using cnn and lstm models in isolation for cross-sentence -ary relation extraction, particularly where the distance between the first () and the last entity () is large. In other words the combined lstm_cnn models are more useful in extracting relations from larger spans of consecutive sentences.

Further, the highest margin between lstm_cnn+wf+pf and the baselines is recorded for binary interactions in single sentences and across sentences with an accuracy of 85.8 and 88.6, respectively (Table 2). This is followed by ternary interactions in single sentences and across sentences with an accuracy of 79.6 and 82.9, respectively (Table 1). It is interesting to note that the average length of tokens () between entities in text spans in the datasets relating to binary and ternary interactions in single sentences and across sentences is of the order 19, 29, 34 and 44, respectively. Based on these results, it can be broadly concluded that the contribution of pf decreases with the increase in the distance between entities in the text span.

(%) (%) (%)

drug-gene-mutation - ternary relations - cross sentence (=44)
cnn+wf 82.9 74.9 79.8
cnn+wf+pf 84.7 76.5 80.3
lstm+wf 46.2 77.0 79.5
lstm+wf+pf 54.2 77.6 80.4
cnn_lstm+wf 51.4 74.9 79.0
cnn_lstm+wf+pf 86.2 74.8 78.8
lstm_cnn+wf 52.0 76.0 79.1
lstm_cnn+wf+pf 81.3 81.3 82.9

drug-gene-mutation - ternary relations - single sentence (=34)
cnn+wf 20.0 73.1 86.6
cnn+wf+pf 10.0 72.0 83.4
lstm+wf 20.0 73.5 85.8
lstm+wf+pf 20.0 73.0 85.6
cnn_lstm+wf 20.0 76.2 87.3
cnn_lstm+wf+pf 20.0 69.7 88.8
lstm_cnn+wf 20.0 76.8 88.0
lstm_cnn+wf+pf 20.0 79.5 86.6

drug-mutation - binary relations - cross sentence (=29)
cnn+wf 0.0 79.6 78.1
cnn+wf+pf 20.0 83.9 82.7
lstm+wf 20.0 80.7 79.9
lstm+wf+pf 20.0 81.2 80.5
cnn_lstm+wf 20.0 78.0 81.3
cnn_lstm+wf+pf 20.0 84.8 87.3
lstm_cnn+wf 20.0 81.6 83.2
lstm_cnn+wf+pf 20.0 90.9 90.2
drug-mutation - binary relations - single sentence (=19)
cnn+wf 16.1 73.5 66.6
cnn+wf+pf 18.4 74.8 67.3
lstm+wf 17.6 77.7 66.5
lstm+wf+pf 16.9 75.7 64.9
cnn_lstm+wf 15.3 72.7 62.5
cnn_lstm+wf+pf 19.2 76.8 65.8
lstm_cnn+wf 16.1 76.4 67.6
lstm_cnn+wf+pf 17.6 84.9 86.5
Table 3: Performance of models on different groups of sentences. - length of tokens between entities and , average number of tokens between and , standard deviation over the length of tokens.

lstm_cnn vs. cnn_lstm.

The results shown above clearly indicate that it is more useful to start with an lstm layer followed by cnn layer (lstm_cnn model) than having a cnn_lstm model for cross-sentence -ary relation extraction. As seen from Tables 1 and 2, the lstm_cnn models perform significantly higher than cnn_lstm models both for ternary and binary relations in single sentences and across sentences. A lstm_cnn model is more useful in that, it initially learns from the sequential information available in the input, which is further exploited by cnn max-pooling layer to identify salient features. However, in the cnn_lstm model, the use of cnn layer with max-pooling as the fist component though helps in identifying salient features from the input, the output from the cnn layer does not retain the sequential information. The cnn output feature vector without sequential information when fed to lstm layer, results in poor performance. This indicates that an lstm_cnn model is more useful than cnn_lstm model for cross-sentence -ary relation extraction. Further, as the results show, the addition of position embeddings in the cnn_lstm model (cnn_lstm+wf+pf) results in poor performance in comparison to the use of word embeddings alone (cnn_lstm+wf). This is particularly true for ternary relation extraction (Table 1). Further as seen in Table 1, the use of higher number of filters combining word embeddings and position embeddings, dramatically lowers the performance. This indicates that position embeddings along with higher number of filters are not useful for cnn_lstm models. However, it is also worthwhile to note that as seen from Table 3, the cnn_lstm+wf+pf model extracts ternary relations in single sentences for the higher number of long-distance spans (88.8%), indicating that cnn_lstm models are useful in certain cases.

cnn and lstm models.

The results provided above clearly shows that, when used in isolation, lstm-based models are more useful for cross-sentence -ary relation extraction, compared to cnn-based models. Interestingly, the use of pf helps only longer sequences (accuracy of 78.9 (lstm+wf+pf) vs. 78.2 (lstm+wf) and 80.8 lstm+wf+pf) vs. 80.3 (lstm+wf+pf) scored for ternary relations in drug-mutation-gene (Table 1) and drug-mutation (Table 2), respectively). However, for shorter sequences, the use of pf results in decrease in accuracy (accuracy of 74.5 (lstm+wf+pf) vs. 75.0 (lstm+wf) and 74.4 lstm+wf+pf) vs. 75.4 (lstm+wf+pf) scored for binary relations in drug-mutation-gene (Table 1) and drug-mutation (Table 2), respectively). The contribution of wf in cnn model significantly improves with the use of higher number of filters, so much so that the model performs better than combining wf and pf. This is particularly true for extracting ternary relations in single sentences (Table 1).

-positional embeddings.

Given entities in the text span, the proposed lstm_cnn+wf+pf model employed only and to create positional embeddings. However, we could also create -positional embeddings for each of the entities in the text span. To this end, we evaluated the lstm_cnn+wf+pf model using -positional embeddings. The use of -positional embeddings resulted in a lower accuracy of 80.5 and 77.9 (compared to 82.5 and 79.6 using position embeddings for and ) for ternary relation extraction across sentence and single sentences, respectively. This indicates that using positional embeddings for and is more useful for cross-sentence relation extraction.

Comparison against the state-of-the-art.

As seen from Table 4, the proposed lstm_cnnw-wf+pf model outperforms various state-of-the-art methods for cross-sentence -ary relation extraction on q&p dataset. These models include graph lstm [Peng et al.2017], feature-based models [Quirk and Poon2016], RNN-based networks such as bilstm [Miwa and Bansal2016] and tree-lstm, and also combining multi-task learning with bilstm and graph lstm [Peng et al.2017]. The strength of the proposed model comes from the fact that the previous state-of-the-art methods heavily rely on syntactic features such as dependency tress, co-reference and discourse features, which are time-consuming and less accurate particularly in the biomedical domain. However, in comparison to these models, the proposed lstm_cnn+wf+pf model does not use any such sophisticated features, but uses much simpler features such as wf and pf. The ability to provide significantly higher performance with much simpler features make the proposed lstm_cnn+wf+pf an attractive choice for cross-sentence -ary relation extraction.

Model Single Cross
Sent. Sents.
drug-gene-mutation - ternary relations
feature-based 74.7 77.7
bilstm 75.3 80.1
graph lstm-embed 76.5 80.6
graph lstm-full 77.9 80.7
bilstm+multi-task - 82.4
graph lstm+multi-task - 82.0
lstm_cnn+wf+pf (proposed model) 79.6 82.9
drug-mutation - binary relations
feature-based 73.9 75.2
bilstm 73.9 76.0
bilstm-shortest-path 70.2 71.7
tree-lstm 75.9 75.9
graph lstm-embed 74.3 76.5
graph lstm-full 75.6 76.7
bilstm+multi-task - 78.1
graph lstm+multi-task - 78.5
lstm_cnn+wf+pf (proposed model) 85.8 88.5

Table 4: Average test accuracy in five-fold cross validation of the proposed model and SOTA methods on -ary cross-sentence relation extraction (q&p dataset)

The performance of lstm_cnn+wf+pf model on cid dataset is provided in Table 5. As seen in Table 5, the lstm_cnn+wf+pf model achieves statistically significant performance for extracting binary relations from text spans with two sentences (

) against methods based on supervised learning using linguistic features and maximum entropy models. The

lstm_cnn+wf+pf model also performs well in extracting binary relations in single sentences (). The combined lstm_cnn+wf+pf model achieves higher F-score (0.63) against various SOTA methods444Note that the SOTA methods listed in Table 5 does not use any knowledge base or the development set for learning the model. on cid dataset as shown in Table 5. The combination of lstm_cnn provides a slight increase than using cnn and lstm separately on cid dataset. The cnn-based models proposed by nguyen2018convolutional nguyen2018convolutional although achieve a higher recall, they tend to achieve a lower precision. The same is the case with cnn+me+pp [Gu et al.2017] and cnn [Zhou et al.2016]. On the other hand, lstms achieve higher precision, but suffer from poor recall (lstm, lstm+svmp [Zhou et al.2016]). In comparison to cnn models and lstm models, the combined lstm_cnn achieve a higher precision and at the same time do not lose on recall, resulting in achieving a higher F-score on cid dataset.

Model P R F
Single sentences (text span where )
linguistic features
[Gu, Qian, and Zhou2016]
0.67 0.68 0.68
cnn [Gu et al.2017] 0.59 0.55 0.57
lstm_cnn+wf+pf (proposed model) 0.69 0.70 0.69

Across sentences (text span where )
linguistic features
[Gu, Qian, and Zhou2016]
0.51 0.29 0.37
maximum entropy [Gu et al.2017] 0.51 0.07 0.11
lstm_cnn+wf+pf (proposed model) 0.57 0.57 0.57*

Across sentences (text span where )
linguistic features + me
[Gu, Qian, and Zhou2016]
0.62 0.55 0.58
cnn+me [Gu et al.2017] 0.60 0.59 0.60
cnn+me+pp [Gu et al.2017] 0.55 0.68 0.61

cnn [Zhou et al.2016]
0.41 0.55 0.47
lstm [Zhou et al.2016] 0.54 0.51 0.53
lstm+svmp [Zhou et al.2016] 0.64 0.49 0.56
lstm+svm+pp [Zhou et al.2016] 0.55 0.68 0.61

svm [Xu et al.2016]
0.55 0.68 0.61

0.54 0.69 0.61
cnn+cnnchar 0.57 0.68 0.62
[Nguyen and Verspoor2018]
0.56 0.68 0.62

lstm_cnn+wf+pf (proposed model)
0.63 0.63 0.63

Table 5: Comparison of performance of lstm_cnn+wf+pf with state-of-the-art models on cid dataset. = number of sentences, P - precision, R - recall, F - F-score.

Performance of lstm_cnn model on se dataset.

To examine the performance of the proposed model on standard relation extraction dataset, the lstm_cnn model was evaluated on se dataset [Hendrickx et al.2009]. The lstm_cnn+wf and lstm_cnn+wf+pf models achieved F1-scores of 71.6 and 81.5, respectively. These scores are slightly better than employing cnn with wf to obtain an F1-score of 69.7 and combining wf and pf with cnn to achieve an F1-score of 78.9, further suggesting that combining lstm and cnn is useful for relation extraction.


To conclude, we presented in this paper a combined lstm_cnn model that exploits both word embeddings and position embeddings for the task of cross-sentence -ary relation extraction. The experimental results provided in this paper clearly establish that combining lstms and cnns offer the ability to harness together the strength of lstms to learn from longer sequences and the usefulness of cnns to learn salient features, vital for cross-sentence -ary relation extraction. The comparison with state-of-the-art results further proves the usefulness of combined lstm and cnn model for cross-sentence -ary relation extraction.


  • [Abadi et al.2016] Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G. S.; Davis, A.; Dean, J.; Devin, M.; et al. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.
  • [Agichtein and Gravano2000] Agichtein, E., and Gravano, L. 2000. Snowball: Extracting relations from large plain-text collections. In Proceedings of the fifth ACM conference on Digital libraries, 85–94. ACM.
  • [Bach and Badaskar2007] Bach, N., and Badaskar, S. 2007. A survey on relation extraction. Language Technologies Institute, Carnegie Mellon University.
  • [Brin1998] Brin, S. 1998. Extracting patterns and relations from the world wide web. In International Workshop on The World Wide Web and Databases, 172–183. Springer.
  • [Bunescu and Mooney2005] Bunescu, R. C., and Mooney, R. J. 2005. A shortest path dependency kernel for relation extraction. In

    Proceedings of the conference on human language technology and empirical methods in natural language processing

    , 724–731.
    Association for Computational Linguistics.
  • [Culotta and Sorensen2004] Culotta, A., and Sorensen, J. 2004. Dependency tree kernels for relation extraction. In Proceedings of the 42nd annual meeting on association for computational linguistics, 423. Association for Computational Linguistics.
  • [Elango2005] Elango, P. 2005. Coreference resolution: A survey. University of Wisconsin, Madison, WI.
  • [Fundel, Küffner, and Zimmer2006] Fundel, K.; Küffner, R.; and Zimmer, R. 2006. Relex—relation extraction using dependency parse trees. Bioinformatics 23(3):365–371.
  • [Gu et al.2017] Gu, J.; Sun, F.; Qian, L.; and Zhou, G. 2017. Chemical-induced disease relation extraction via convolutional neural network. Database 2017.
  • [Gu, Qian, and Zhou2016] Gu, J.; Qian, L.; and Zhou, G. 2016. Chemical-induced disease relation extraction with various linguistic features. Database 2016.
  • [Hearst1992] Hearst, M. A. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th conference on Computational linguistics-Volume 2, 539–545. Association for Computational Linguistics.
  • [Hendrickx et al.2009]

    Hendrickx, I.; Kim, S. N.; Kozareva, Z.; Nakov, P.; Ó Séaghdha, D.; Padó, S.; Pennacchiotti, M.; Romano, L.; and Szpakowicz, S.

    2009. Semeval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. In Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions, 94–99. Association for Computational Linguistics.
  • [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
  • [Hochreiter et al.2001] Hochreiter, S.; Bengio, Y.; Frasconi, P.; Schmidhuber, J.; et al. 2001. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies.
  • [Hsu et al.2017] Hsu, S. T.; Moon, C.; Jones, P.; and Samatova, N. 2017. A hybrid cnn-rnn alignment model for phrase-aware sentence classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, volume 2, 443–449.
  • [Lai et al.2015] Lai, S.; Xu, L.; Liu, K.; and Zhao, J. 2015. Recurrent convolutional neural networks for text classification. In AAAI, volume 333, 2267–2273.
  • [Lee and Dernoncourt2016] Lee, J. Y., and Dernoncourt, F. 2016. Sequential short-text classification with recurrent and convolutional neural networks. arXiv preprint arXiv:1603.03827.
  • [Mikolov et al.2013] Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  • [Miwa and Bansal2016] Miwa, M., and Bansal, M. 2016. End-to-end relation extraction using lstms on sequences and tree structures. arXiv preprint arXiv:1601.00770.
  • [Moschitti, Patwardhan, and Welty2013] Moschitti, A.; Patwardhan, S.; and Welty, C. 2013. Long-distance time-event relation extraction. In IJCNLP, 1330–1338.
  • [Nagesh2016] Nagesh, P. 2016. Exploiting tree kernels for high performance chemical induced disease relation extraction. In 4TH ANNUAL DOCTORAL COLLOQUIUM,  15.
  • [Nguyen and Verspoor2018] Nguyen, D. Q., and Verspoor, K. 2018. Convolutional neural networks for chemical-disease relation extraction are improved with character-based word embeddings. arXiv preprint arXiv:1805.10586.
  • [Peng et al.2017] Peng, N.; Poon, H.; Quirk, C.; Toutanova, K.; and Yih, W.-t. 2017. Cross-sentence n-ary relation extraction with graph lstms. Transactions of the Association for Computational Linguistics 5:101–115.
  • [Pennington, Socher, and Manning2014] Pennington, J.; Socher, R.; and Manning, C. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 1532–1543.
  • [Quirk and Poon2016] Quirk, C., and Poon, H. 2016. Distant supervision for relation extraction beyond the sentence boundary. arXiv preprint arXiv:1609.04873.
  • [Srivastava et al.2014] Srivastava, N.; Hinton, G. E.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: a simple way to prevent neural networks from overfitting.

    Journal of machine learning research

  • [Swampillai and Stevenson2010] Swampillai, K., and Stevenson, M. 2010. Inter-sentential relations in information extraction corpora. In LREC.
  • [Wang, Jiang, and Luo2016] Wang, X.; Jiang, W.; and Luo, Z. 2016.

    Combination of convolutional and recurrent neural network for sentiment analysis of short texts.

    In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, 2428–2437.
  • [Xu et al.2015] Xu, Y.; Mou, L.; Li, G.; Chen, Y.; Peng, H.; and Jin, Z. 2015. Classifying relations via long short term memory networks along shortest dependency paths. In EMNLP, 1785–1794.
  • [Xu et al.2016] Xu, J.; Wu, Y.; Zhang, Y.; Wang, J.; Lee, H.-J.; and Xu, H. 2016. Cd-rest: a system for extracting chemical-induced disease relation in literature. Database 2016.
  • [Zeng et al.2014] Zeng, D.; Liu, K.; Lai, S.; Zhou, G.; Zhao, J.; et al. 2014. Relation classification via convolutional deep neural network. In COLING, 2335–2344.
  • [Zhou et al.2016] Zhou, H.; Deng, H.; Chen, L.; Yang, Y.; Jia, C.; and Huang, D. 2016. Exploiting syntactic and semantics information for chemical–disease relation extraction. Database 2016.