Making complex decisions in domains like biomedicine and clinical treatments requires access to information and facts in a form that can be easily viewed by experts and is computable by reasoning algorithms. The predominant paradigm for storing this type of data is in a knowledge graph. Much of these facts are populated from hand curation by human experts, inevitably leading to high levels of incompleteness[5, 6]. To address this, researchers have focused on automatically constructing knowledge bases by directly extracting information from text .
This procedure can be broken down into three major components; identifying mentions of entities in text [36, 25, 40], linking mentions of the same entity together into a single canonical concept [11, 18, 35], and identifying relationships occurring between those entities [8, 45, 44].
These three stages are nearly always treated as separate serial components in an extraction pipeline and current state-of-the-art approaches train separate machine learning models for each component, each with their own distinct training data. More precisely, this data consists of mention-level supervision, that is individual instances of entities and relations which are identified and demarcated in text. This type of data can be prohibitively expensive to acquire, particularly in domains like biomedicine where expert knowledge is required to understand and annotate relevant information.
In contrast, forms of distant supervision are readily available as database entries in existing knowledge bases. This type of information encodes global properties about entities and their relationships without identifying specific textual instances of those facts. This form of distant supervision has been successfully applied to relation extraction models [31, 41, 37]. However, all of these methods still consume entity linking decisions as a preprocessing step, and unfortunately, accurate entity linkers and the mention-level supervision required to train them do not exist in many domains.
In this work, we instead develop a method to simultaneously link entities in the text and extract their relationships (see Fig. 1). Our proposed method, called SNERL (Simultaneous Neural Entity-Relation Linker), can be trained by leveraging readily available resources from existing knowledge bases and does not utilize any mention-level supervision. In experiments performed on two different biomedical datasets, we show that our model is able to substantially outperform a state-of-the-art pipeline of entity linking and relation extraction by jointly training and testing the two tasks together.
In this section, we describe the proposed model, Simultaneous Neural Entity-Relation Linker (SNERL), and how it’s trained. The input to the model is the full title and abstract of an article and the output is the predicted graph of entities and relations represented in the text (see Fig. 1). This is done by first encoding the text using self-attention  to obtain a contextualized representation of each entity mention in the input. These contextualized representations are then used to predict both the distribution over entities at the mention-level and the distribution over relations at the mention-pair-level. These predicted probabilities are then combined for each mention-pair and pooled at the document-level to get a final probability for predicting the tuple for the text (see Fig. 2).
Let denote the set of natural numbers . Each document consists of a set of words indexed by where
is the vocabulary size. Entity mentions in the document are found using a named entity recognition (NER) system. Let for be the set of mention start indices for the document, where is the number of mentions in the document. For each mention string we generate up to C candidate entities (see Candidate Generation for details). Let be the set of all entities. Each document is annotated with the graph of entities and relations, given as a set of tuples , where and . This is obtained from a knowledge base under the strong distant supervision assumption  (see Experiments section for details). Let be the set of entities in the annotations for the document .
denotes concatenation of vectorsand .
The initial input to our model is the full title and abstract of a biomedical article from PubMed. 111https://www.ncbi.nlm.nih.gov/pubmed/ The sequence is tokenized and each token is mapped to a
-dimensional word embedding. The sequence of word embeddings are the input to our text encoder. The text encoder is based on the Transformer architecture of vaswani2017attention vaswani2017attention. The transformer applies multiple blocks of multi-head self-attention followed by width 1 convolutions. We follow naacl18-verga naacl18-verga and add additional width 5 convolutions between blocks. The reader is referred to the Supplementary for the specific details. The text encoder, after multiple blocks of transformations, generates position and context informed hidden representations for each word in the document. The output of the text encoder is an-dimensional contextualized embedding for each token :
From an efficiency perspective, we only encode the document once and use the contextualized token representations to predict both the entities and the relations.
From the contextualized token representations
where denotes an element-wise mean of a set of vectors and denotes an element-wise max of a set of vectors. Now, for each mention, we generate candidates entities for the mention. Such a candidate generation step is often used in entity-linking models  and in many domains, such as for Wikipedia entities, high quality candidates can be generated by using prior linking counts of mention surface forms to entities obtained from Wikipedia anchor texts [16, 35]. However, such high quality candidate generation is not available in the biomedical domain and so we resort to an approximate string matching approach for generating candidate entities.
. Each mention was first normalized by removing all punctuation, lower-casing, and then stemming. Next, these strings were converted to tfidf vectors consisting of both word and character ngrams. We considered character ngrams of lengths two to five, and for words we considered unigrams and bigrams. The same procedure was also applied to convert all canonical string names and synonyms for entities in our knowledge base. Finally, candidates for each mention were generated according to their cosine similarity amongst all entities in the knowledge base.
For each candidate entity with type , we generate a -dimensional entity embedding as , by adding an entity-specific embedding and a -dimensional entity type embedding . The entity-specific embedding can be learned or it can be a pre-trained embedding obtained from another source such as entity descriptions [16, 49] or by a graph embedding method . Now, for the -th mention in the document, with starting index , we consider as a contextualized mention representation and define a score for predicting the candidate entity for this mention using the candidate representation , document representation , and mention representation . This is passed through a softmax function, normalizing over the set of candidates for the mention to get a probability for linking the mention to entity .
We thus obtain a matrix of linking probabilities for the document, where M is the maximum number of entity mentions in the document and C is the maximum number of candidates per mention. Note that there is no direct mention-level supervision available to train these probabilities.
Given the contextualized mention representation, we obtain a head and tail representation for each mention to serve as the head or tail entity of a relation tuple . This is done by using two MLP to project each mention representation.
The head and tail representations are then passed through an MLP to predict a score for every relation for a pair of mentions and
. We pass this score vector through a sigmoid function to get a probability of predicting the relation from the mention-pair.
We thus obtain a matrix of probabilities for predicting all relations, where R is the maximum number of relations, from all pairs of entity mentions.
Combining entity and relation predictions
To predict the graph of entities and relations from the document, we need to assign a probability to every possible relation tuple . We first obtain the probability of predicting a tuple from a mention-pair by combining the probability for predicting the candidates for each of the mentions (1) and the relation prediction probability (2). If an entity is not a candidate for a mention then it’s entity prediction probability is zero for that mention.
Then, the probability of extracting the tuple from the entire document can be obtained by pooling over all mention pairs . For example, we can use max-pooling, which corresponds to the inductive bias that in order to extract a tuple we must find at least one mention pair for the corresponding entities in the document that is evidence for the tuple.
Soft maximum pooling:
It has been observed previously [43, 12] that the hard max operation is not ideal for pooling evidence as it leads to very sparse gradients. Recent methods, thus use the logsumexp function for pooling over logits, which allows for more dense gradient updates. However, we cannot use the logsumexp function in our case to pool over the probabilities (3) as the result of logsumexp over independent probabilities is not guaranteed to be a probability (in ). Thus, we use a different operator that is considered a smooth relaxation of the maximum . Given a set of elements , the smooth-maximum (smax) with temperature is defined as:
Note that for the result of smax tends to the maximum of the set and for the result is the average of the set. Thus, smax
can smoothly interpolate between these extremes. We use thissmax pooling over probabilities in (4) with a learned temperature .
We are given ground-truth annotation for the set of tuples in the document, . We train based on the cross-entropy loss from predicted tuple probabilities (4). Since we only have a subset of positive annotations, there is uncertainty in the set of negatives, and we deal with this by weighting the positive annotations by a weight in the cross-entropy loss. Let if document is annotated with the relation tuple and 0 otherwise, and be its predicted probability in (4), then we maximize :
In addition, since we can obtain document-level entity annotations from the set of annotated relation tuples, we can provide an additional document-level entity supervision to better train our entity linking probabilities. To do this, we perform max-pooling over all mentions for each candidate entity for the document in (1), to obtain a document-level entity prediction score . We compute a weighted cross-entropy for these document-level predictions, again up-weighting the positive entities with a weight . In summary, we combine graph prediction and document-level entity prediction objectives similar to multi-task learning , so if is the set of entities in annotation, we maximize:
Note that since we only have some positive annotations, there could be many mentions in the document for which the correct entity is not annotated. Thus, we down-weight the document-entity prediction term by in the objective.
Technical Details: Since the size of can be very large, in order to improve training efficiency we subsample the set of unannotated entities as the negative entities to a maximum of per document. Pooling over the joint mention-level probability (4) requires an intermediate tensor, where L is the total number of candidate entities for the document. Since this can be computationally prohibitive, we compute the top- mentions per candidate entity based on the predicted probabilities (1
) and only backpropagate the gradients through the top-. We consider
as a hyperparameter and tune it on the validation set.
Our experimental setting is that, for each test document (title and abstract), the model should predict the full graph of entity-relationships expressed in that document (a single example is depicted in Fig. 1). Thus, we evaluate on micro-averaged precision, recall and F1 for predicting the entire set of annotated relation tuples across documents. Our results show significant improvement in F1 over a state-of-the-art pipelined approach . Hyperparameters are in the Supplementary.
All of our models use the same neural architecture described earlier, consume the same predicted entity mentions from an external NER model , and differ in how they produce entity linking decisions. The first two baselines take hard entity linking decisions as inputs and do not do any internal entity linking inference. Both these baselines are equivalent to the BRAN model from naacl18-verga naacl18-verga with two different ways of obtaining entity links for that model. This is a state-of-the-art pipelined approach to entity-relation extraction.
We used an MLP as the relation scoring function for BRAN (similar to the SNERL model) as it performed better in experiments compared to the biaffine function used in the original paper.
BRAN (Top Candidate) produces entity linking decisions based on the highest scoring candidate entity (as described in ‘Candidate Generation’ section).
BRAN (Linker) produces entity linking decisions from a trained state-of-the-art entity linker. We followed BRAN and obtained entity links from wei2013pubtator wei2013pubtator.
SNERL is our proposed model that does not take in any hard entity linking decisions as input and instead jointly predicts the full set of entities and relations within the text. For this model we considered 25 candidates per mention.
Our first set of experiments are on the CTD dataset first introduced in naacl18-verga naacl18-verga. The data is derived from annotations in the Chemical Toxicology Database , a curated knowledge base containing relationships between chemicals, diseases, and genes. Each fact additionally contains a reference to the document (a scientific publication) where the annotator identified the relationship. Thus, these are used to obtain annotator identified entity-relationships in a given scientific publication. This type of document annotation is fairly common in biomedical knowledge bases, further motivating this work. This allows us to treat these annotations as a form of strong distant supervision 
. Here annotations are at the document-level rather than the mention-level (as in typical supervised learning) or corpus-level (as in standard distant supervision).
|Top 1 Candidates||67.0%|
|Top 25 Candidates||80.0%|
An aspect of the document-level supervision is that the original facts were annotated over complete documents. However, due to paywalls we often only have access to titles and abstracts of papers. Therefore, there is no guarantee that the relationship is actually expressed in the title or abstract. naacl18-verga naacl18-verga, thus, filtered the CTD dataset to only consider those entity-relation tuples where both entities are found in the text, for some mention, by the external entity linker. This ensures that all filtered tuples can be predicted by the model. However, this removes many correct entity-relationships that were indeed present but were filtered because those entities cannot be predicted by the entity linking model. We remedy this and create a more challenging train/development/test split from the entire CTD annotations, where we keep all entity-relationships in which the participating entities are a candidate for some mention in the document. That is, for each annotated tuple between entities and in document , we consider that tuple if both and are candidates for some mention in . Note that we used the candidate-generation approach described previously for generating the candidates for the mentions222we consider up to 250 candidates entities per mention for the data filtering. Dataset statistics are in the Supplementary. We consider this as the Full CTD Dataset as it does not give an advantage to any particular entity linking model, but for completeness also evaluate on the subset filtered according to the original paper (we refer to this as BRAN-filtered).
To illustrate how the cascading errors of a pipelined approach of first predicting entity links and then predicting relations can degrade performance, we computed an oracle recall for the tuple prediction task on the development set of CTD. For this, we consider perfect accuracy on relation prediction, so the recall on tuple extraction is limited only by the entity linking accuracy. We consider three methods for entity linking: predicting the top candidate (based on the string similarity score from candidate generation), an oracle which can select the correct entity (if present) from the top 25 candidates and the trained entity liking model used by BRAN. Table 1 shows the results. We can see that errors from the entity linking step significantly restrict the models performance in pipelined approaches. On the other hand, if the model can infer the entity links (from top 25 candidates) jointly with the relations, it ameliorates this problem of cascading errors, potentilally leading to much higher recall.
|BRAN (Top Candidate)||30.5||29.5||30.0|
Results on Full CTD data:
In table 2, we can see that the SNERL model that jointly considers both entity and relations together drastically outperforms the models that take hard linking decisions from an external model. This is primarily due to huge drop in recall caused by cascading errors.
Results on BRAN Filtered CTD data:
We also report results using the original filtering approach of naacl18-verga naacl18-verga. Importantly, this approach gives a substantial advantage to the BRAN (Linker) baseline as the data is filtered to only consider the relationships for which it could potentially make a prediction. In table 3, we can see that in spite of this disadvantage, the SNERL model is able to perform comparably to the BRAN (Linker) baseline.
|BRAN (Top Candidate)||43.0||49.0||45.8|
CDR Entity Linking Performance
In order to evaluate how much of the success of the SNERL model can be attributed to the entity linking component (1), we evaluated its performance on the BioCreative V Chemical Disease Relation dataset (CDR) introduced in wei2015overview wei2015overview. Similar to the CTD dataset, CDR was also originally derived from the Chemical Toxicology Database. Expert annotators chose 1,500 of those documents and exhaustively annotated all mentions of chemicals and diseases in the text. Additionally, each mention was assigned its appropriate entity linking decision. We use this dataset as a gold standard to validate our entity linking models. Note that we do not use this data for training, but only for evaluation.
We use the model that was trained on the CTD data and make it predict entities for every mention on the test set of CDR. We follow the standard practice of using the gold mention boundaries for evaluation only, to not confound the entity linking performance with mention-detection performance. In Table 4, we see that our SNERL does learn to link entities better than the top candidate. As is common when evaluating on this data, we consider document-level rather than mention-level entity linking evaluation , that is, how does the set of predicted entities compare to the gold set annotated in the document. Note that the SNERL model additionally benefits from jointly predict entities and relations. Breakdown of the results into Chemical and Disease prediction performance can be found in Supplementary.
To further probe the performance of our model we created a dataset of disease / phenotype (aka symptom) relations. The goal here is to identify specific symptoms caused by a disease. This type of information is particularly important in clinical treatments as it can lead to earlier diagnosis of rare diseases, faster application of appropriate interventions, and better overall outcomes for patients. This task also serves to further motivate our methods as accurate entity linking models for phenotypes are not readily available, nor is sufficient mention-level training data to build a supervised classifier.
Relation Annotations: We created this dataset with a similar technique to the construction of the CTD dataset. We started from the relations in the Human Phenotype Ontology  that were annotated with a document containing that relationship.
Mention Detection: For disease mention detection we followed the same procedure as CTD dataset and used the annotated mentions from wei2013pubtator wei2013pubtator. Because there is not a readily available phenotype tagger, we trained our own model to identify mentions of phenotypes in text. We trained an iterated dilated convolution model strubell2017fast strubell2017fast.333https://github.com/iesl/dilated-cnn-ner Our training data came from groza2015automatic groza2015automatic, which we split into train, dev, and test sets (see Supplementary). Our NER model achieved a micro F1 score of 72.57.
We observed that disease and phenotype entity spans are often overlapping and nested. We thus over-generate the set of mentions by taking the predictions from both the taggers and adding them to the set of all mentions for the document, since our model is able to pool over all theses mentions even if they overlap.
Entity Linking: We followed a similar procedure as described in section Predicting entities
to generate phenotype entity linking candidates. Using the small set of gold entity linked text mentions from groza2015automatic groza2015automatic we were able to estimate our candidate’s entity linking accuracy. Our top candidate achieved an accuracy of 46.8% while the recall for 100 candidates was 76.5%. This demonstrates the additional difficulty of the disease-phenotype dataset as these candidate accuracies are much lower than the results for CTD data. See Supplementary Figure 1 for recall of the candidate set at different values of K.
kohler2018expansion kohler2018expansion annotations make use of several disease vocabularies from OMIM , ORHPANET  and DECIPHER  databases. For generating disease candidates, we use disease name strings from all of these. The external entity linker that we used from wei2013pubtator wei2013pubtator links diseases to the MeSH disease vocabulary. To align these with our disease-phenotype relation annotations, we use the MEDIC database  for mapping OMIM disease terms into the MeSH vocabulary.
The final dataset annotations were selected by filtering based on entities that can be found in document when considering up to 250 candidates per mention. See Supplementary for dataset statistics.
Pre-training Entity Embedding
Since the dataset has many unseen entities at test time, we need a method to address these unseen entities as generating the linking probabilities in (1) requires an entity embedding. For this, we obtained entity descriptions for the phenotypes and encoded them using pre-trained sentence embedding from BioSentVec . However, not all test entities have descriptions. So, in addition to the descriptions we trained a graph embedding model, DistMult 
, on the graph obtained from the set of all annotations in Human Phenotype Ontology excluding the dev/test annotations. We project both these pre-trained embeddings using a learned linear transformation and sum the description and graph embedding to obtain the entity-specific embedding.
We use the same baselines as before. For BRAN (Linker), the disease entity links come wei2013pubtator wei2013pubtator and since we don’t have access to an accurate pretrained phenotype entity linking model, this model also uses the top phenotype candidate as a hard phenotype entity linking decision.
Our disease-phenotype results show a similar trend to those from the CTD experiments. Overall, the BRAN (Top Candidate) model performs the worst and the SNERL model outperforms both models that use hard entity linking decisions.
Overall, our results indicate that this particular task is extremely challenging. This is likely the combination of several difficulties. The first is that the candidate set itself is not as accurate as the ones from the CTD experiment which we can see from comparing the top candidate accuracy of 46.8% with the Top Candidate results in Table 4. Since we rely on the candidate set to filter the annotations for the documents, we might end up with significant annotations that are not present in the title and abstract. Secondly, the amount of training data is significantly less (see Supplementary) than in the CTD experiments, requiring research into unsupervised approaches  for this data. Lastly, dealing with out-of-vocabulary entities at test time required additional pre-training, and our analysis indicated that these are not highly predictive for mention-level disambiguation due to the sparsity of the graph training data. Looking into more sophisticated embedding methods [49, 18, 4] and methods for dealing with unseen entities would be an important problem for future work.
|BRAN (Top Candidate)||8.9||5.3||6.6|
Extracting entities and relations from text has been widely studied over the past few decades. In the biomedical domain specifically, there has been substantial progress on entity mention detection [17, 47] and entity linking (often referred to as normalization in the bio NLP community) [26, 27, 29, 28], and relation extraction [48, 23]. There have also been numerous works that have identified both entity mentions and relationships from text in both the general domain  and in the biomedical domain [30, 1, 44]. leaman2016taggerone leaman2016taggerone showed that jointly considering named entity recognition (NER) and linking led to improved performance.
A few works have shown that jointly modeling relations and entity linking can improve performance. le2018improving le2018improving improved entity linking performance by modeling latent relations between entities. This is similar to coherence models  in entity linking which consider the joint assignment of all linking decisions, but is more tractable as it focuses on only pairs of entities in a short context. luan2018multi luan2018multi created a multi-task learning model for predicting entities, relations, and coreference in scientific documents. This model required supervision for all three tasks and predictions amongst the different tasks were made independently rather than jointly. To the best of our knowledge, SNERL is the first model that simultaneously links entities and predicts relations without requiring expensive mention-level annotation.
In this paper, we presented a novel method, SNERL, to simultaneously predict entity linking and entity relation decisions. SNERL can be trained without any mention-level supervision for entities or relations, and instead relies solely on weak and distant supervision at the document-level, readily available in many biomedical knowledge bases. The proposed model performs favorably as compared to a state-of-the-art pipeline approach to relation extraction by avoiding cascading errors, while requiring less expensive annotation, opening possibilities for knowledge extraction in low-resource and expensive to annotate domains.
We thank Andrew Su and Haw-Shiuan Chang for early discussions on the disease-phenotype task. This work was supported in part by the UMass Amherst Center for Data Science and the Center for Intelligent Information Retrieval, in part by the Chan Zuckerberg Initiative under the project Scientific Knowledge Base Construction, and in part by the National Science Foundation under Grant No. IIS-1514053. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor.
-  (2017) The ai2 system at semeval-2017 task 10 (scienceie): semi-supervised end-to-end entity and relation extraction. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 592–596. Cited by: Related Work.
-  (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: Appendix B.
-  (2015) Content driven user profiling for comment-worthy recommendations of news and blog articles. In Proceedings of the 9th ACM Conference on Recommender Systems, pp. 195–202. Cited by: Soft maximum pooling:.
-  (2019) A2N: attending to neighbors for knowledge graph inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4387–4392. Cited by: Results.
-  (2004) The unified medical language system (umls): integrating biomedical terminology. Nucleic acids research 32 (suppl_1), pp. D267–D270. Cited by: Introduction.
-  (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 1247–1250. Cited by: Introduction.
-  (2013) DECIPHER: database for the interpretation of phenotype-linked plausibly pathogenic sequence and copy-number variation. Nucleic acids research 42 (D1), pp. D993–D1000. Cited by: Disease-Phenotype Relations.
-  (2007) Learning to extract relations from the web using minimal supervision. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 576–583. Cited by: Introduction.
-  (1993) Multitask learning: a knowledge-based source of inductive bias. Proceedings of the Tenth International Conference on International Conference on Machine Learning (ICML), pp. 41–48. Cited by: Training.
-  (2018) BioSentVec: creating sentence embeddings for biomedical texts. arXiv preprint arXiv:1810.09302. Cited by: Pre-training Entity Embedding.
Large-scale named entity disambiguation based on wikipedia data.
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Cited by: Introduction.
Chains of reasoning over entities, relations, and text using recurrent neural networks. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Vol. 1, pp. 132–141. Cited by: Soft maximum pooling:.
-  (2018) The comparative toxicogenomics database: update 2019. Nucleic acids research 47 (D1), pp. D948–D954. Cited by: CTD Dataset.
-  (2012) MEDIC: a practical disease vocabulary used at the comparative toxicogenomics database. Database 2012. Cited by: Disease-Phenotype Relations.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: Results.
-  (2017) Deep joint entity disambiguation with local neural attention. arXiv preprint arXiv:1704.04920. Cited by: Predicting entities, Predicting entities, Related Work.
-  (2018) Marginal likelihood training of bilstm-crf for biomedical named entity recognition from disjoint label sets. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2824–2829. Cited by: Related Work.
-  (2017) Entity linking via joint encoding of types, descriptions, and context. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2681–2690. Cited by: Introduction, Results.
-  (2005) Online mendelian inheritance in man (omim), a knowledgebase of human genes and genetic disorders. Nucleic acids research 33 (suppl_1), pp. D514–D517. Cited by: Disease-Phenotype Relations.
-  (2010) Overview of the tac 2010 knowledge base population track. In Third Text Analysis Conference (TAC 2010), Vol. 3, pp. 3–3. Cited by: Introduction.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix E.
-  (2018) Expansion of the human phenotype ontology (hpo) knowledge base and resources. Nucleic acids research 47 (D1), pp. D1018–D1027. Cited by: Disease-Phenotype Relations.
-  (2017) Overview of the biocreative vi chemical-protein interaction track. In Proceedings of the sixth BioCreative challenge evaluation workshop, Vol. 1, pp. 141–146. Cited by: Related Work.
-  (2004) Integrated annotation for biomedical information extraction. In HLT-NAACL 2004 Workshop: Linking Biological Literature, Ontologies and Databases, Cited by: Appendix E.
-  (2016) Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360. Cited by: Introduction.
-  (2008) BANNER: an executable survey of advances in biomedical named entity recognition. In Biocomputing 2008, pp. 652–663. Cited by: Related Work.
-  (2013) DNorm: disease name normalization with pairwise learning to rank. Bioinformatics 29 (22), pp. 2909–2917. Cited by: Related Work.
TaggerOne: joint named entity recognition and normalization with semi-markov models. Bioinformatics 32 (18), pp. 2839–2846. Cited by: Predicting entities, CDR Entity Linking Performance, Related Work.
-  (2015) TmChem: a high performance approach for chemical named entity recognition and normalization. Journal of cheminformatics 7 (1), pp. S3. Cited by: Related Work.
-  (2017) A neural joint model for entity and relation extraction from biomedical text. BMC bioinformatics 18 (1), pp. 198. Cited by: Related Work.
-  (2009) Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 1003–1011. Cited by: Introduction, Notations:, CTD Dataset.
-  (2016) End-to-end relation extraction using lstms on sequences and tree structures. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Vol. 1, pp. 1105–1116. Cited by: Related Work.
-  (2018) Hierarchical losses and new resources for fine-grained entity typing and linking. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Vol. 1, pp. 97–109. Cited by: Predicting entities.
-  (2017) Clinical practice guidelines for rare diseases: the orphanet database. PloS one 12 (1), pp. e0170365. Cited by: Disease-Phenotype Relations.
DeepType: multilingual entity linking by neural type system evolution.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: Introduction, Predicting entities.
-  (2009) Design challenges and misconceptions in named entity recognition. In Proceedings of the thirteenth conference on computational natural language learning, pp. 147–155. Cited by: Introduction.
-  (2013) Relation extraction with matrix factorization and universal schemas. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 74–84. Cited by: Introduction.
-  (2015) Entity linking with a knowledge base: issues, techniques, and solutions. IEEE Transactions on Knowledge and Data Engineering 27 (2), pp. 443–460. Cited by: Predicting entities.
-  (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: Appendix E.
-  (2017) Fast and accurate entity recognition with iterated dilated convolutions. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2670–2680. Cited by: Introduction.
-  (2012) Multi-instance multi-label learning for relation extraction. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, pp. 455–465. Cited by: Introduction.
-  (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: Appendix B, Appendix E, Methodology.
-  (2017) Generalizing to unseen entities and entity pairs with row-less universal schema. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Vol. 1, pp. 613–622. Cited by: Soft maximum pooling:.
-  (2018-06) Simultaneously Self-attending to All Mentions for Full-Abstract Biological Relation Extraction. In Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT), New Orleans, Louisiana. Cited by: Appendix B, Introduction, Experiments, Related Work.
-  (2016) Relation classification via multi-level attention cnns. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Vol. 1, pp. 1298–1307. Cited by: Introduction.
-  (2013) PubTator: a web-based text mining tool for assisting biocuration. Nucleic acids research 41 (W1), pp. W518–W522. Cited by: Notations:, Baselines.
-  (2015) GNormPlus: an integrative approach for tagging genes, gene families, and protein domains. BioMed research international 2015. Cited by: Related Work.
-  (2016) Assessing the state of the art in biomedical relation extraction: overview of the biocreative v chemical-disease relation (cdr) task. Database 2016. Cited by: Related Work.
-  (2016) Representation learning of knowledge graphs with entity descriptions. In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: Predicting entities, Results.
-  (2014) Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575. Cited by: Predicting entities, Pre-training Entity Embedding.
Appendix A Supplementary Material
Appendix B Transformer Text Encoder
The model takes a sequence of N word embeddings as input, . Since the Transformer has no innate notion of position, the model relies on positional embeddings which are added to the input token embeddings. The positional embeddings are also learned parameters of the model. Thus, we get the token representations:
Transformer  is made up of B blocks. Each Transformer block, denoted transformer, has its own set of parameters and consists of two components: multi-head attention followed by a series of convolutions. The output for token of block , , is connected to its input
with a residual connection:
In each block, multi-head attention  applies self-attention multiple times over the same inputs using separately normalized parameters (attention heads) and combines the results. Each head in multi-head self-attention updates its input by performing a weighted sum over all tokens in the sequence, weighted by their importance for modeling token . Refer to vaswani2017attention vaswani2017attention for details of multi-head attention. The outputs of the individual attention heads are concatenated, to give the output of multi-head attention at the -th token, . This is followed by layer normalization , and two width-1 convolutions. Following 
, we add a third layer with kernel width 5 convolutions, which allows explicit n-gram modeling useful for relation extraction. This gives the output at the-th token for the -th transformer block in (6).
The sequence of representations at each token obtained after blocks of processing, described above, is the final output of the transformer text encoder:
Appendix C CTD Dataset
The number of documents in the splits of CTD dataset are given in Table 6. There are 19933 entities and 14 relation types in this data. The number of entity-relationship tuples in train/dev/test are given in Table 7.
Appendix D Entity Linking on CDR
Appendix E Implementation Details
All word embeddings are randomly initialized. Text is tokenized using the Genia tokenizer . We used dropout  at the input word embeddings (), on attention weights in the transformer  (), after head and tail projection MLP (), and after the first layers of the relation () and linking MLP (). We also apply dropout to the input words replacing words with a special UNK token (). Note that the values of dropout reported here are keep probabilities. We used Adam  for optimization with a learning rate of 0.001. We tuned the dropout rates, weights for the cross-entropy term , number of blocks of transformer, number of heads , number of negative samples , for the number of top mentions, and the weight for the objective.
On CTD, full dataset, the best hyperparameters were: , , , , , , , , , . We used embedding dimension .
On CTD, BRAN-filtered dataset, the best hyperparameters were: , , , , , , , , , . We used embedding dimension .
On the Disease-Phenotype dataset, the best hyperparameters were: , , , , , , , , , . We used embedding dimension .
Appendix F Disease-Phenotype Dataset
Our final NER model achieved a micro F1 score of 75.00 on the development set and 72.57. The train/dev/test split consisted of 173/23/23 documents with 1294/118/160 metions respectively.
Table 10 shows the train/dev/test splits for the disease-phenotype relation extraction dataset.
Figure 3 shows the recall for phenotype candidate set.