Attending to All Mention Pairs for Full Abstract Biological Relation Extraction

10/23/2017 ∙ by Patrick Verga, et al. ∙ University of Massachusetts Amherst Chan Zuckerberg Initiative LLC 0

Most work in relation extraction forms a prediction by looking at a short span of text within a single sentence containing a single entity pair mention. However, many relation types, particularly in biomedical text, are expressed across sentences or require a large context to disambiguate. We propose a model to consider all mention and entity pairs simultaneously in order to make a prediction. We encode full paper abstracts using an efficient self-attention encoder and form pairwise predictions between all mentions with a bi-affine operation. An entity-pair wise pooling aggregates mention pair scores to make a final prediction while alleviating training noise by performing within document multi-instance learning. We improve our model's performance by jointly training the model to predict named entities and adding an additional corpus of weakly labeled data. We demonstrate our model's effectiveness by achieving the state of the art on the Biocreative V Chemical Disease Relation dataset for models without KB resources, outperforming ensembles of models which use hand-crafted features and additional linguistic resources.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With few exceptions, nearly all work in relation extraction focuses on classifying a short span of text within a single sentence containing a single entity pair mention. However, relationships between entities are often expressed across sentence boundaries or require a larger context to disambiguate. For example, in the Biocreative V CDR dataset (Section

3), 30% of relations are expressed across sentence boundaries, such as in the following excerpt:

Treatment of psoriasis with azathioprine. Azathioprine treatment benefited 19 (66%) out of 29 patients suffering from severe psoriasis. Haematological complications were not troublesome and results of biochemical liver function tests remained normal. Minimal cholestasis was seen in two cases and portal fibrosis of a reversible degree in eight. Liver biopsies should be undertaken at regular intervals if azathioprine therapy is continued so that structural liver damage may be detected at an early and reversible stage.

Though the entities’ mentions never occur in the same sentence, the above example expresses that the chemical entity azathioprine can cause the side effect fibrosis. In order to extract relations from such text, between entities with relations expressed across sentence boundaries, we propose Bi-affine Relation Attention Networks (BRAN) which predict relationships between all mention pairs within a document simultaneously. We efficiently encode full-paper abstracts using self-attention over byte-pair encoded sub-word tokens. This allows the model to consider a wider context between distant mention pairs, as well as correlations between multiple mentions. Making simultaneous predictions also allows us to apply within-document multi-instance learning, leveraging document level annotation by alleviating noise caused by a lack of mention level annotation. We demonstrate state of the art performance for a model using no external knowledge base resources in experiments on the Biocreative V CDR dataset.

2 Model

Our model first contextually encodes input token embeddings. These contextual embeddings are used to predict both entities and relations. The relation extraction module converts each token to a head and tail representation. These representations are used to form mention-pair predictions using a bi-affine operation with respect to learned relation embeddings. Finally the mention-level predictions are pooled to form an entity-level prediction.

Figure 1: The relation extraction architecture. Inputs are contextually encoded using the Transformer(Vaswani et al., 2017). Each transformed token is then passed through a head and tail MLP to produce two separate versions of each token. A bi-affine operation is then performed between each head and tail

token with respect to each relation’s embedding matrix, producing a pairwise-relation affinity tensor. Finally, the scores for cells corresponding to the same entity pair are pooled with a separate LogSumExp operation for each relation to get a final score. The colored tokens illustrate calculating the score for a given pair of entities; The model is only given entity information when gathering scores to pool from the affinity matrix.

2.1 Inputs

Our model takes in a sequence of token embeddings in . Because the original Transformer has no recurrence, convolutions, or other mechanism of modeling position information, the model relies on positional embeddings which are added to the input token embeddings111Even though our final model incorporates some convolutions, we retain the position embeddings. We learn position embedding matrix which contains a separate dimensional embedding for each position, limited to possible positions. Our final input representation for token is:

Where is the token embedding for and is the positional embedding for the th position. If exceeds

, we use a randomly initialized vector in place of


We tokenize the text using byte pair encoding (Gage, 1994; Sennrich et al., 2015) which is well suited for biological data for a number of reasons. First, biological entities often have unique mentions made up of meaningful subcomponents, such as ‘1,2-dimethylhydrazine’. By learning sub-word representations the model is able to make predictions on rare or unknown words. Additionally, tokenization of chemical entities is challenging, lacking a universally agreed upon algorithm (Krallinger et al., 2015).

Byte pair encoding constructs a vocabulary of sub-word pieces beginning with single characters. Then, the algorithm iteratively merges the most frequent co-occurring tokens into a new token, which is added to the vocabulary. This procedure continues until a pre-defined vocabulary size is met.

2.2 Transformer

We use the Transformer self-attention model

(Vaswani et al., 2017) to encode tokens by aggregating over their context in the entire sequence. The Transformer is made up of blocks. Each Transformer block, , has its own set of parameters and is made up of two subcomponents, multi-head attention and a series of convolutions. The output of block is connected to its input

with a residual connection

(He et al., 2016):

2.2.1 Multi-headed Attention

Multi-head attention applies self-attention multiple times over the same inputs using separate parameters (attention heads) and combines the results, as an alternative to applying one pass of attention with more parameters. The intuition behind this modeling decision is that dividing the attention into multiple heads make it easier for the model to learn to attend to different types of relevant information with each head. The self-attention updates input by aggregating information for all tokens in the sequence weighted by their importance.

Each input is projected to a key, value, and query, using separate affine transformations with activations. Where , , and , each in where is the number of heads. The attention weights are computed as scaled dot-product attention as:

Where is element-wise multiplication and indicates a softmax along the dimension. The scaled attention is meant to aid optimization by flattening the softmax and better distributing the gradients (Vaswani et al., 2017).

The outputs of the individual heads of the multi-headed attention are concatenated, denoted , into . All layers in the network use residual connections between the output of the multi-headed attention and its input. Layer normalization (Ba et al., 2016), denoted LN, is then applied to the output.

2.2.2 Feed-Forward

The second part of the transformer block is a stack of convolutional layers. The sub-network used in Vaswani et al. (2017)

uses two width-1 convolutions. We add a third middle layer with kernel width 5, which we found to perform better. Many relations are expressed concisely by the immediate local context (“Michele’s husband Barack”, “labetalol -induced hypotension”). Adding in this explicit n-gram modeling is meant to ease the burden on the model to learn to attend to the local features entirely on its own. We use

to denote a convolutional layer with convolutional kernel width . Then the convolutional portion of the transformer block is given by:

Where the dimensions of and are in and that of is in .

is the rectified linear activation function

(Glorot et al., 2011).

2.3 Bi-affine Pairwise Scores

We project each contextually encoded token through two separate MLPs to generate two new versions of each token corresponding to whether it will serve as the first or second argument of a relation.

For each head, tail, relation triple, we calculate a score using a bi-affine operator to create an tensor of pairwise affinity scores:

where is a tensor, a learned embedding matrix for each of the relations.

2.4 Entity Level Prediction

Our data is weakly labeled in that there are labels at the entity level but not the mention level, making the problem a form of strong-distant supervision (Mintz et al., 2009)

. In distant supervision, edges in a knowledge graph are heuristically applied to sentences in an auxiliary unstructured text corpus — often applying the edge label to all sentences containing the subject and object of the relation. Because this process is imprecise and introduces noise into the training data, methods like multi-instance learning were introduced

(Riedel et al., 2010; Surdeanu et al., 2012)

. In multi-instance learning, rather than looking at each distantly labeled mention pair in isolation, the model is trained over the aggregate of these mentions and a single update is made. More recently, the weighting function of the instances has been expressed as neural network attention

(Verga and McCallum, 2016; Lin et al., 2016; Yaghoobzadeh et al., 2017).

We aggregate over all representations for each mention pair in order to produce per-relation scores for each entity pair. For each entity pair , we select the vectors in where , :

The LogSumExp scoring function is a smooth approximation to the max function and has the benefits of aggregating information from multiple predictions and propagating dense gradients as opposed to the sparse gradient updates of the max (Das et al., 2017).

2.5 Named Entity Recognition

In addition to making pair level relation predictions, the final transformer output can be used to make entity type predictions. Our model uses a linear classifier which takes as input and predicts the entity label for each token to produce per-class scores :

We encode entity labels using the BIO encoding. We apply tags to the byte-pair tokenization by treating each sub-word within a mention span as an additional token with a corresponding B or I label.

We train the NER and relation objectives jointly, sharing all embeddings and Transformer parameters. We penalize the named entity updates with a hyperparameter


3 Experiments

We perform experiments on the Biocreative V chemical disease relation extraction (CDR)222 dataset (Li et al., 2016a; Wei et al., 2016). The dataset was derived from the Comparative Toxicogenomics Database (CTD) which curates interactions between genes, chemicals, and diseases (Davis et al., 2008). These annotations are only at the document level and do not contain mention annotations. The CDR dataset is a subset of these original annotations supplemented with human annotated, entity linked mention annotations. The relation annotations in this dataset are also at the document level only. In addition to the gold CDR data, Peng et al. (2016) add 15,448 additional PubMed abstracts annotated in the CTD dataset. We consider this same set of abstracts as additional training data (which we denote +Data). Since this data does not contain entity annotations, we take the annotations from Pubtator (Wei et al., 2013), a state of the art biological named entity tagger and entity linker. In our experiments we only evaluate our relation extraction performance and all models (including baselines) use gold entity annotations for predictions. We compare against the previous best reported results on this dataset not using knowledge base features. 333The highest reported score is from Peng et al. (2016) but uses explicit lookups into the CTD knowledge base for the existence of the test entity pair. Each of the baselines are ensemble methods that make use of additional parse and part-of-speech features. Gu et al. (2017) use a CNN sentence classifier while Zhou et al. (2016a) use an LSTM. Both make cross-sentence predictions with featurized classifiers.

3.1 Results

In Table 2 we show results outperforming the baselines despite using no syntactic or linguistic features. We show performance averaged over 20 runs with 20 random seeds as well as an ensemble of their averaged predictions. We see a further boost in performance by adding in the additional weakly labeled data. Table 2

shows the effects of removing pieces of our model. ‘CNN only’ removes the multi-head attention component from the transformer block, ‘no width-5’ replaces the width-5 convolution of the feed-forward component of the transformer with a width-1 convolution and ‘no NER’ removes the named entity recognition multi-task objective (section


Model P R F1
Gu et al. (2016) 62.0 55.1 58.3
Zhou et al. (2016a) 55.6 68.4 61.3
Gu et al. (2017) 55.7 68.1 61.3
BRAN 55.6 70.8 62.1 0.8
+ Data 64.0 69.2 66.2 0.8
BRAN(ensemble) 63.3 67.1 65.1
+ Data 65.4 71.8 68.4
Table 1: Precision, recall, and F1 results on the Biocreative V CDR Dataset.
Model P R F1
BRAN (Full) 55.6 70.8 62.1 0.8
– CNN only 43.9 65.5 52.4 1.3
– no width-5 48.2 67.2 55.7 0.9
– no NER 49.9 63.8 55.5 1.8
Table 2: Results on the Biocreative V CDR Dataset showing precision, recall, and F1 for various model ablations.

3.2 Implementation Details

The CDR dataset is concerned with extracting only chemically induced disease relationships (drug-related side effects and adverse reactions) concerning the most specific entity in the document. For example ‘tobacco causes cancer’ could be marked as false if the document contained the more specific ‘lung cancer.’ This can cause true relations to be labeled as false, harming evaluation performance. To address this we follow Gu et al. (2016, 2017) and filter hypernyms according to the hierarchy in the MESH controlled vocabulary 444 All entity pairs within the same abstract that do not have an annotated relation are assigned the NULL label.

The model is implemented in Tensorflow

(Abadi et al., 2015). The byte pair vocabulary is generated over the training dataset -either just the gold CDR data with budget 2500 or gold CDR data plus additional data from section (Peng et al., 2016) with budget 10000. All embeddings are 64 dimensional. Token embeddings are pre-trained using skipgram Mikolov et al. (2013) over a random subset of 10% of all PubMed abstracts with window size 10 and 20 negative samples. The number of transformer block repeats is . We optimize the model using Adam Kingma and Ba (2015) with best parameters chosen for , , chosen from the development set. The learning rate is set to and batch size 32. In all of our experiments we set the number of attention heads to .

We clip the gradients to norm 10 and apply noise to the gradients Neelakantan et al. (2015) with . We tune the decision threshold and perform early stopping on the development set. We apply dropout Srivastava et al. (2014)

to the input layer randomly replacing words with a special UNK token with keep probability

. We additionally apply dropout to the input (word embedding + position embedding), interior layers, and final state. At each step, we randomly sample a positive or negative (NULL class) minibatch with probability . We merge the train and development sets and randomly take 850 abstracts for training and 150 for early stopping. Our reported results are averaged over 10 runs and using different splits. All baselines train on both the train and development set.

4 Related work

Relation extraction is a heavily studied area in the NLP community. Most work focuses on news and web data (Doddington et al., 2004; Riedel et al., 2010; Hendrickx et al., 2009)555 There is also a considerable body of work in supervised biological relation extraction including protein-protein (Pyysalo et al., 2007; Poon et al., 2014; Mallory et al., 2015), drug-drug (Segura-Bedmar et al., 2013), and chemical-disease (Gurulingappa et al., 2012; Li et al., 2016a) interactions, and more complex events (Kim et al., 2008; Riedel et al., 2011). Recent neural network approaches to relation extraction have focused on CNNs (dos Santos et al., 2015; Zeng et al., 2015) or LSTMs (Miwa and Bansal, 2016; Verga et al., 2016; Zhou et al., 2016b) and replacing stage-wise information extraction pipelines with a single end-to-end model (Miwa and Bansal, 2016; Ammar et al., 2017; Li et al., 2017).

A few exceptions exist that perform cross-sentence relation extraction (Swampillai and Stevenson, 2011; Quirk and Poon, 2017; Peng et al., 2017). Most similar to our work would be (Peng et al., 2017) which uses a variant of an LSTM to encode document-level syntactic parse trees. Our work differs in several key ways. It operates over raw tokens negating the need part of speech or parse features which can lead to cascading errors. We also use a feed-forward neural architecture which encodes long sequences far more efficiently compared to the graph LSTM network of (Peng et al., 2017). Finally, our model considers all mention pairs rather than a single mention pair at a time.

Pairwise bilinear models have also been used extensively in knowledge graph link prediction (Nickel et al., 2011; Li et al., 2016b) sometimes restricting the bilinear relation matrix to be diagonal (Yang et al., 2015) or diagonal and complex (Trouillon et al., 2016). Our model is similar to recent approaches in neural graph-based parsing where bilinear parameters are use to score a head-dependent relationship (Kiperwasser and Goldberg, 2016; Dozat and Manning, 2017).

5 Conclusion

We present a bilinear relation attention network that simultaneously produces predictions for all mention pairs within a document. With this model we are able to outperform the previous state of the art on the Biocreative V CDR dataset. Our model also lends itself to other tasks such as hypernym prediction, coreference resolution, and entity resolution. We plan to investigate these directions in future work.


  • Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from
  • Ammar et al. (2017) Waleed Ammar, Matthew E. Peters, Chandra Bhagavatula, and Russell Power. 2017. The ai2 system at semeval-2017 task 10 (scienceie): semi-supervised end-to-end entity and relation extraction. nucleus 2(e2):e2.
  • Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 .
  • Das et al. (2017) Rajarshi Das, Arvind Neelakantan, David Belanger, and Andrew McCallum. 2017. Chains of reasoning over entities, relations, and text using recurrent neural networks. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. Association for Computational Linguistics, Valencia, Spain, pages 132–141.
  • Davis et al. (2008) Allan Peter Davis, Cynthia G Murphy, Cynthia A Saraceni-Richards, Michael C Rosenstein, Thomas C Wiegers, and Carolyn J Mattingly. 2008. Comparative toxicogenomics database: a knowledgebase and discovery tool for chemical–gene–disease networks. Nucleic acids research 37(suppl_1):D786–D792.
  • Doddington et al. (2004) George Doddington, Alexis Mitchell, Mark Przybocki, Lance Ramshaw, Stephanie Strassel, and Ralph Weischedel. 2004. The automatic content extraction (ace) program tasks, data, and evaluation. In Proceedings of the Fourth International Conference on Language Resources and Evaluation.
  • dos Santos et al. (2015) Cicero dos Santos, Bing Xiang, and Bowen Zhou. 2015. Classifying relations by ranking with convolutional neural networks. In

    Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

    . Association for Computational Linguistics, Beijing, China, pages 626–634.
  • Dozat and Manning (2017) Timothy Dozat and Christopher D Manning. 2017. Deep biaffine attention for neural dependency parsing. 5th International Conference on Learning Representations .
  • Gage (1994) Philip Gage. 1994. A new algorithm for data compression. The C Users Journal 12(2):23–38.
  • Glorot et al. (2011) Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifier neural networks. In

    Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics

    . pages 315–323.
  • Gu et al. (2016) Jinghang Gu, Longhua Qian, and Guodong Zhou. 2016. Chemical-induced disease relation extraction with various linguistic features. Database 2016.
  • Gu et al. (2017) Jinghang Gu, Fuqing Sun, Longhua Qian, and Guodong Zhou. 2017.

    Chemical-induced disease relation extraction via convolutional neural network.

    Database 2017.
  • Gurulingappa et al. (2012) Harsha Gurulingappa, Abdul Mateen Rajput, Angus Roberts, Juliane Fluck, Martin Hofmann-Apitius, and Luca Toldo. 2012. Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports. Journal of biomedical informatics 45(5):885–892.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    . pages 770–778.
  • Hendrickx et al. (2009)

    Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid Ó Séaghdha, Sebastian Padó, Marco Pennacchiotti, Lorenza Romano, and Stan Szpakowicz. 2009.

    Semeval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. In Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions. Association for Computational Linguistics, pages 94–99.
  • Kim et al. (2008) Jin-Dong Kim, Tomoko Ohta, and Jun’ichi Tsujii. 2008. Corpus annotation for mining biomedical events from literature. BMC bioinformatics 9(1):10.
  • Kingma and Ba (2015) Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference for Learning Representations (ICLR). San Diego, California, USA.
  • Kiperwasser and Goldberg (2016) Eliyahu Kiperwasser and Yoav Goldberg. 2016. Simple and accurate dependency parsing using bidirectional lstm feature representations. Transactions of the Association for Computational Linguistics 4:313–327.
  • Krallinger et al. (2015) Martin Krallinger, Obdulia Rabal, Florian Leitner, Miguel Vazquez, David Salgado, Zhiyong Lu, Robert Leaman, Yanan Lu, Donghong Ji, Daniel M Lowe, et al. 2015. The chemdner corpus of chemicals and drugs and its annotation principles. Journal of cheminformatics 7(S1):S2.
  • Li et al. (2017) Fei Li, Meishan Zhang, Guohong Fu, and Donghong Ji. 2017. A neural joint model for entity and relation extraction from biomedical text. BMC bioinformatics 18(1):198.
  • Li et al. (2016a) Jiao Li, Yueping Sun, Robin J Johnson, Daniela Sciaky, Chih-Hsuan Wei, Robert Leaman, Allan Peter Davis, Carolyn J Mattingly, Thomas C Wiegers, and Zhiyong Lu. 2016a. Biocreative v cdr task corpus: a resource for chemical disease relation extraction. Database 2016.
  • Li et al. (2016b) Xiang Li, Aynaz Taheri, Lifu Tu, and Kevin Gimpel. 2016b. Commonsense knowledge base completion. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, pages 1445–1455.
  • Lin et al. (2016) Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, and Maosong Sun. 2016. Neural relation extraction with selective attention over instances. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, pages 2124–2133.
  • Mallory et al. (2015) Emily K Mallory, Ce Zhang, Christopher Ré, and Russ B Altman. 2015. Large-scale extraction of gene interactions from full-text literature using deepdive. Bioinformatics 32(1):106–113.
  • Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 .
  • Mintz et al. (2009) Mike Mintz, Steven Bills, Rion Snow, and Daniel Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Association for Computational Linguistics, Suntec, Singapore, pages 1003–1011.
  • Miwa and Bansal (2016) Makoto Miwa and Mohit Bansal. 2016. End-to-end relation extraction using lstms on sequences and tree structures. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, pages 1105–1116.
  • Neelakantan et al. (2015) Arvind Neelakantan, Luke Vilnis, Quoc V Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, and James Martens. 2015. Adding gradient noise improves learning for very deep networks. arXiv preprint arXiv:1511.06807 .
  • Nickel et al. (2011) Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. 2011. A three-way model for collective learning on multi-relational data. In

    Proceedings of the 28th international conference on machine learning (ICML-11)

    . Bellevue, Washington, USA, pages 809–816.
  • Peng et al. (2017) Nanyun Peng, Hoifung Poon, Chris Quirk, Kristina Toutanova, and Wen-tau Yih. 2017. Cross-sentence n-ary relation extraction with graph lstms. Transactions of the Association for Computational Linguistics 5:101–115.
  • Peng et al. (2016) Yifan Peng, Chih-Hsuan Wei, and Zhiyong Lu. 2016. Improving chemical disease relation extraction with rich features and weakly labeled data. Journal of cheminformatics 8(1):53.
  • Poon et al. (2014) Hoifung Poon, Kristina Toutanova, and Chris Quirk. 2014. Distant supervision for cancer pathway extraction from text. In Pacific Symposium on Biocomputing Co-Chairs. pages 120–131.
  • Pyysalo et al. (2007) Sampo Pyysalo, Filip Ginter, Juho Heimonen, Jari Björne, Jorma Boberg, Jouni Järvinen, and Tapio Salakoski. 2007. Bioinfer: a corpus for information extraction in the biomedical domain. BMC bioinformatics 8(1):50.
  • Quirk and Poon (2017) Chris Quirk and Hoifung Poon. 2017. Distant supervision for relation extraction beyond the sentence boundary. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. Association for Computational Linguistics, Valencia, Spain, pages 1171–1182.
  • Riedel et al. (2011) Sebastian Riedel, David McClosky, Mihai Surdeanu, Andrew McCallum, and Christopher D. Manning. 2011. Model combination for event extraction in bionlp 2011. In Proceedings of BioNLP Shared Task 2011 Workshop. Association for Computational Linguistics, Portland, Oregon, USA, pages 51–55.
  • Riedel et al. (2010) Sebastian Riedel, Limin Yao, and Andrew McCallum. 2010. Modeling relations and their mentions without labeled text. Machine learning and knowledge discovery in databases pages 148–163.
  • Segura-Bedmar et al. (2013) Isabel Segura-Bedmar, Paloma Martínez, and María Herrero Zazo. 2013. Semeval-2013 task 9 : Extraction of drug-drug interactions from biomedical texts (ddiextraction 2013). In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013). Association for Computational Linguistics, Atlanta, Georgia, USA, pages 341–350.
  • Sennrich et al. (2015) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 .
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research 15(1):1929–1958.
  • Surdeanu et al. (2012) Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D. Manning. 2012. Multi-instance multi-label learning for relation extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, Jeju Island, Korea, pages 455–465.
  • Swampillai and Stevenson (2011) Kumutha Swampillai and Mark Stevenson. 2011. Extracting relations within and across sentences. In Proceedings of the International Conference Recent Advances in Natural Language Processing 2011. RANLP 2011 Organising Committee, Hissar, Bulgaria, pages 25–32.
  • Trouillon et al. (2016) Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. 2016. Complex embeddings for simple link prediction. In International Conference on Machine Learning. pages 2071–2080.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762 .
  • Verga et al. (2016) Patrick Verga, David Belanger, Emma Strubell, Benjamin Roth, and Andrew McCallum. 2016. Multilingual relation extraction using compositional universal schema. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, San Diego, California, pages 886–896.
  • Verga and McCallum (2016) Patrick Verga and Andrew McCallum. 2016. Row-less universal schema. In Proceedings of the 5th Workshop on Automated Knowledge Base Construction. Association for Computational Linguistics, San Diego, CA, pages 63–68.
  • Wei et al. (2013) Chih-Hsuan Wei, Hung-Yu Kao, and Zhiyong Lu. 2013. Pubtator: a web-based text mining tool for assisting biocuration. Nucleic Acids Research 41.
  • Wei et al. (2016) Chih-Hsuan Wei, Yifan Peng, Robert Leaman, Allan Peter Davis, Carolyn J Mattingly, Jiao Li, Thomas C Wiegers, and Zhiyong Lu. 2016. Assessing the state of the art in biomedical relation extraction: overview of the biocreative v chemical-disease relation (cdr) task. Database 2016.
  • Yaghoobzadeh et al. (2017) Yadollah Yaghoobzadeh, Heike Adel, and Hinrich Schütze. 2017. Noise mitigation for neural entity typing and relation extraction. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. Association for Computational Linguistics, Valencia, Spain, pages 1183–1194.
  • Yang et al. (2015) Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. 2015. Embedding entities and relations for learning and inference in knowledge bases. In 3rd International Conference for Learning Representations (ICLR). San Diego, California, USA.
  • Zeng et al. (2015) Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. 2015. Distant supervision for relation extraction via piecewise convolutional neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, pages 1753–1762.
  • Zhou et al. (2016a) Huiwei Zhou, Huijie Deng, Long Chen, Yunlong Yang, Chen Jia, and Degen Huang. 2016a. Exploiting syntactic and semantics information for chemical–disease relation extraction. Database 2016.
  • Zhou et al. (2016b) Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li, Hongwei Hao, and Bo Xu. 2016b. Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Berlin, Germany, pages 207–212.