Identifying the standardized concepts referred to in a scientific article or a clinical record is a key component of biomedical natural language processing, enabling harmonization across different documents for search and semantic analysisJovanovic2017. Medical entity linking, also referred to as medical concept normalization (MCN), is the task of harmonizing different surface forms for the same concepts, so that documents mentioning Amyotrophic lateral sclerosis and Lou Gehrig’s Disease can be correctly analyzed as referring to the same disease. MCN has historically relied heavily on string matching approaches metamap; ctakes, leveraging broad-coverage controlled vocabularies such as the Systematized Nomenclature of Medicine (SNOMED) snomed and RxNorm rxnorm, which map a wide variety of surface forms to canonical identifiers. However, string matching approaches are susceptible to a variety of issues, including sensitivity to misspellings Lai2015, lack of coverage, especially in specialized domains and language other than English Skeppstedt2012; Kuang2015, and mismatch between patient and provider language Zeng2006; Park2016a.
Deep learning techniques offer an alternative approach to address these challenges in a rapidly-scalable way Zhao2019AAAI; Weegar2019. However, the sheer number of standardized medical concepts to choose from poses a significant challenge for broad-coverage application of deep learning: the Unified Medical Language System, or UMLS umls, a large-scale resource combining over 100 biomedical terminologies that are commonly used for medical entity linking, contains over 4.2 million unique concepts. For any given application, most of these concepts are irrelevant, and should thus be ignored as candidate labels Figueroa2009. One significant resource provided by the UMLS that has not yet been systematically explored for deep medical entity linking is the semantic type assigned to each concept: by identifying the appropriate semantic type for a medical entity mention, the set of candidate labels can be reduced by an order of magnitude.
In this paper, we investigate the impact of incorporating a mention disambiguation step in existing medical entity linkers. For this, we propose a novel method, , which assigns a semantic type to identified mentions based on its context in the text and utilizes it for refining the list of candidate concepts. Since the medical domain suffers from a dearth of training data, we create a novel corpus for medical entity linking, WikiMed for pre-training . We also exploit the distant supervision paradigm distant_sup to create a large-scale noisy annotated corpus, PubMedDS, which further enhances ’s predictions. Our contributions can be summarized as follows:
We probe the impact of incorporating a mention disambiguation step in existing entity linkers. For this, we propose , a novel deep learning system for detecting the semantic types of a medical entity mention based on its context.
We present two large-scale medical entity linking datasets: WikiMed and PubMedDS for medical entity linking research. WikiMed includes over 650,000 mentions normalized to concepts in UMLS. Also, we utilize distant supervision for creating a nosily annotated corpus PubMedDS with more than 5 million normalized mentions spanning across 3.5 million documents.
Through extensive experiments, we demonstrate that trained on our novel datasets prunes out a significant number of irrelevant candidate concepts and gives state-of-the-art performance for medical entity linking.
’s source code and datasets proposed in the paper are publicly available at http://github.com/svjan5/medtype.
2 Related Work
Entity Linking: Identifying the database concepts and named entities in a text is an essential task for semantic understanding of natural text and information extraction end-to-end-EL. It involves two steps: (i) detecting mentions of entities of interest in the text (NER); and (ii) linking the extracted mentions to entries in a database such as Wikipedia, Freebase (entity linking, EL). A variety of techniques have been investigated to improve linking performance: entity typing deeptype2018; onoe2019fine
, using densified knowledge graphsELDEN, leveraging dynamic graph convolution networks dynamicgcn-EL and performing multi-task joint learning of NER and EL joint-learning.
Medical Entity Linking: Entity linking has long been a prominent task in biomedical NLP, with multiple heavily-used tools for identifying medical concepts in biomedical literature and clinical text metamap1; ctakes. Medical entity linking has historically relied on large, expert-curated vocabularies of standardized medical terminology for string matching-based approaches, with great success Meystre2008; Jovanovic2017. However, deep learning-based systems for medical EL is a relatively nascent area of research, aimed at addressing the limitations of string matching Soysal2018; Zhao2019a; Tutubalina2018; triplet-network. These approaches typically choose from the entire vocabulary of medical concepts in a given vocabulary (often hundreds of thousands of choices), thus trading the coverage limitations of string matching for potential overgeneration of candidates for a given medical concept mention.
Entity Typing: One strategy that informs candidate generation, among many other downstream NLP tasks, is entity typing: this refers to the act of assigning a semantic type to mentions in the text. Entity typing improves performance in diverse downstream NLP tasks such as co-reference resolution durrett2014joint, relation extraction yaghoobzadeh-etal-2017-noise, and question answering das-QA, among others. Fine-grained entity type information has proven to be highly effective for entity linking deeptype2018; ling-etal-2015-design. Consequently, recent research has focused on fine-grained entity typing using a wide array of techniques such as incorporating hierarchical losses through bi-linear mappings murty-etal-2018-hierarchical, distant supervision using head-words choi-etal-2018-ultra, semi-supervised graph-based classification HMGCN, contextualized word embeddings with latent type representations lin-ji-2019-attentive and enhanced representations using knowledge bases ernie.
Entity Typing in Medical Domain: In the medical domain, the role of entity typing in selecting good candidates for medical concept mentions was recognized in some of the earliest rule-based medical entity extraction tools Aronson1994. However, its use in deep learning-based systems has been very limited. semantictype utilize neural language modeling frameworks to identify the semantic type of a mention in medical text, but do not apply their predictions downstream; by contrast, Medlinker
utilize approximate dictionary matching heuristics with specialized neural language models to improve both medical entity typing and entity linking in biomedical literature. However, these works have not explored the efficacy of utilizing type information within the entity linking task itself, a key component of identifying high-quality candidate concepts and the main focus of this work.
3 Method Overview
Entity linking is defined as the task of identifying mentions in a given textual passage and linking them to an entity in a predefined knowledge graph. In the medical domain, most of the existing entity linkers ctakes; metamap; quick_umls adopt a two-step approach. The first step involves recognizing named entities in the text (NER) and the second step involves generating a set of candidate entities for each identified mention. However, in most existing systems, the inability to resolve ambiguity among possible concepts has been identified as their greatest weaknesses metamap. For instance, a mention ‘cold’ can refer to distinct concepts such as common cold (disease), cold temperature (natural phenomena), or cold brand of chlorpheniramine-phenylpropanolamine (pharmacologic substance). Thus, including an additional entity disambiguation step in existing entity linkers has the potential to improve their performance. Prior works deeptype have demonstrated the effectiveness of this approach for wikification task which involves linking mentions to Wikipedia. However, this has not yet been explored in the medical domain.
: In this work, we probe the impact of incorporating entity disambiguation step in several existing medical entity linking systems. For this, we propose , which takes in a set of generated candidates for a mention and filter them based on its context in the text. utilizes recent advances in deep learning based language modeling techniques elmo; bert for encoding context. The overall architecture of is shown in Figure 1. Further details of the method are presented in Section 4. Several existing medical entity linkers such as cTAKES ctakes, Quick-UMLS quick_umls, and MetaMap metamap are non-neural and predominantly rely on substring matching for linking mentions to entities. Even for neural med-linkers like SciSpacy scispacy, allows them to utilize context for refining their output and leverage the advancement in deep learning based techniques. Apart from that, we also propose two large-scale annotated datasets: WikiMed and PubMedDS which can serve as a great resource for the community. In this paper, we demonstrate their effectiveness for entity disambiguation task. More details on datasets creation are provided in Section 5.
Formally, the entity linking task is defined as: Let be a predefined set of entities in a knowledge graph and be a given unstructured text with tokens. The entity linking task involves identifying mentions of the form in and mapping them to an entity . In this paper, we define as the set of entities in the Unified Medical Language System, or UMLS umls, a large-scale compendium of multiple source vocabularies
which provides broad coverage of around 4.2 million biomedical concepts. Most of the existing entity linking methods follow a two-step procedure: (1) Named Entity Recognition (NER), the task of recognizing concept mentions
in the text, and (2) Candidate Generation, involves generating probable set of candidates along with scoresfor each identified mention . Given the prominence of ambiguity among mentions in text such as cold which can refer to distinct entities in different scenarios, including an additional step for disambiguating mentions can be helpful for handling polysemy.
As depicted in Figure 1, our proposed method, , takes in a generated candidate set for a given mention and outputs a filtered set based on the context of the mention in the text. This is obtained by predicting the semantic type of the mention based on its context. The UMLS Semantic Network maps all UMLS concepts into one or more semantic types such as Anatomical Structure and Pharmacologic Substance, out of a total 127 types. thus models entity typing as a multi-label classification problem, and the predicted types of a given mention are used to prune the candidate set. As many of these types are very fine-grained and have sparse coverage in entity linking corpora, we use the coarse-grained Semantic Groups developed by sem_grouping and is-a relationships in the Semantic Network to relabel these types into 24 semantic groups.
Context Encoder: For training , we take the input data of the form where denotes mention along with its context which comprises of its neighboring tokens in a window of size , i.e., and is the semantic type. Motivated by the ability to handle polysemous tokens and superior modeling capabilities of long range dependencies of Transformer based models transformer, we utilize a pre-trained BERT bert encoder and fine-tune it for our type prediction task. We feed the mention concatenated with its context, i.e., as input to the encoder along with their positional information for allowing model to distinguish between them. Finally, the embedding corresponding to the initial [CLS]
token is passed to a feed-forward classifier for 24-way prediction of semantic types.
5 Training corpora
The availability of large scale public datasets has helped to foster research in various domains imagenet; snli. However, obtaining and sharing medical data presents a major obstacle given the sensitive nature of the data and privacy issues med_data_issues. Moreover, crowd-sourced annotation of medical information requires expensive and difficult-to-obtain expertise to ensure data quality med_data_annotation. We overcome these challenges by automatically creating two large training dataset for entity disambiguation.
5.1 Transfer Learning with WikiMed
Wikipedia, though not a medical data source, does include many mentions of medical concepts that can inform entity typing models. In order to leverage transfer learning, an effective tool to solve the problem of insufficient training data in specialized domains transfer1; transfer2; transfer3, we developed a Wikipedia dataset for medical entity linking, which we use to pre-train before fine-tuning it on medical data. We utilize three knowledge sources to map hyperlinks in Wikipedia articles to UMLS entities: Wikidata wikidata, Freebase freebase, and the NCBI Taxonomy ncbi_mapping. This maps approximately 60,500 Wikipedia page IDs to UMLS concepts, with which we extract 652,000 relevant mentions from 6 million Wikipedia articles to create the WikiMed dataset. More data statistics of WikiMed are summarized in Table 1. Since Wikipedia is human-curated, we assume that most of its linkages are of sufficiently high-quality to yield accurate mappings. Thus, we also utilize WikiMed as one of the evaluation datasets for assessing the performance of medical entity linkers on non-medical domain text. WikiMed is an order of magnitude larger in size than previous medical entity linking datasets such as MedMentions medmentions and the NCBI disease corpus ncbi_data, and covers medical entities from diverse semantic types. We provide a detailed comparison of semantic types coverage of datasets in Appendix Table 5.
5.2 Distant supervision in biomedical language: PubMedDS
Distant supervision distant_sup enables automatic generation of noisy training data and has been exploited for several tasks ds_event_extraction; ds_entity_linking; ds_emotion1; ds_re_reside, including identifying potential mentions of medical concepts ds_entity_linking_medical. In order to create a large-scale training dataset for medical entity linking drawn from biomedical language, we use distant supervision on PubMed abstracts to generate PubMedDS. Unlike Wikipedia, PubMed abstracts do not include a priori links to database entities: we, therefore, run a state-of-the-art biomedical NER model scispacy on 20 million PubMed abstracts to extract medical entity mentions. We then make use of the Medical Subject Headings (MeSH) tags assigned to each PubMed article to filter the extracted entity mentions: a mention is retained only when it exactly matches with the name of one of the provided MeSH headers. The UMLS maps MeSH headers to UMLS concept identifiers, which we utilize to get the semantic type of each extracted mention. The procedure gives us PubMedDS, a dataset with 5 million noisily-annotated mentions which we utilize for pre-training .
6 Experimental Setup
In our experiments, we evaluate the models on four benchmark datasets for medical entity linking, as well as our novel WikiMed dataset. Dataset statistics are provided in Table 1.
NCBI The NCBI-disease corpus of ncbi_data consists of 793 PubMed abstracts annotated with disease mentions and their corresponding concepts in the MEDIC vocabulary MEDIC.
Bio CDR The CDR corpus cdr_data consists of 1,500 PubMed abstracts annotated with mentions of chemicals, diseases, and relations between them. These mentions were normalized to their unique concept identifiers, using MeSH as the controlled vocabulary.
ShARe The ShARe corpus data of share-2014 comprises 431 anonymized clincial notes, obtained from the MIMIC II clinical dataset mimic2 and annotated with disorder mentions.
MedMentions The MedMentions data of medmentions consists of 4,392 PubMed abstracts annotated with several biomedical mentions. Each mention is labelled with a unique concept identifier and a semantic type using the UMLS as the target ontology.
6.2 Medical Entity Linkers
For evaluating the effectiveness of , we utilize it to filter candidate medical concepts identified using the following entity linking models.
MetaMap metamap1 leverages a knowledge-intensive approach based on symbolic NLP and linguistic techniques to map biomedical mentions in text to UMLS concepts. Metamap has evolved metamap to incorporate additional features such as detection of author-defined acronyms, and detection of negations.
cTAKES ctakes uses a terminology-agnostic dictionary look-up algorithm for mapping named entities to UMLS concepts. We utilize the Clinical Pipeline of cTAKES augmented with LVG Annotator 111https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+4.0+-+LVG.
MetaMapLite metamaplite re-implements the basic functionalities of MetaMap with an additional emphasis on real-time processing and competitive performance.
QuickUMLS quick_umls is a fast, unsupervised algorithm that leverages approximate, dictionary-matching techniques for mapping biomedical entities in text.
ScispaCy scispacy builds upon the robust spaCy library spacy for several biomedical and scientific text-processing applications such as parsing, named entity recognition, and entity linking using different neural architectures.
In this section, we provide details for the following major findings:
7.1 Entity Linking with
For demonstrating the effectiveness of incorporating for entity disambiguation, we compare its performance with entity linkers listed in Section 6.2. We report precision, recall, and f1-score for partial mention and entity identifier match as proposed by partial_metric. We also report scores with strict mention/entity id match (used in TAC KBP 2013) and only entity id match in the Appendix Table 6 and 7 respectively. For computing the scores, we use a publicly available entity linking evaluation library222https://github.com/wikilinks/neleval. For highlighting the maximum possible improvement from entity disambiguation, we report the results with semantic type information from an oracle. As described in Section 4, in our experiments, we work with 24 coarse-grained semantic types instead of 127 types provided by UMLS. Hence, for comparison, we present the results with both fine-type (Oracle (F)) and coarse-type (Oracle (C)) oracles.
The overall results are presented in Table 2. For each model, we report its default performance and the change in scores when used along with different entity disambiguation methods. The results for are obtained after training it on WikiMed, PubMedDS and the training split of the corresponding datasets. Overall, we find that across all the models and datasets, gives a substantial improvement in performance. Moreover, in none of the settings, it leads to any degradation. Thus, including an entity disambiguation module only enhances entity linking systems. Further, we note that the gain with is comparable to improvement with using an oracle in most of the settings. The small improvement in case of cTAKES model is attributed to the fact that it generates substantially fewer candidates for detected mentions. Thus, does not allow disambiguation module to provide much scope for improvement. The results overall justify the central thesis of this work, that entity disambiguation module helps in improving entity linking.
|Training data (T)||87.5||88.2||70.2||73.3|
7.2 Effect of Transfer Learning and Distant Supervision on Entity Disambiguation
In this section, we analyze the impact of using our proposed datasets: WikiMed and PubMedDS
for entity disambiguation. For this, we compare the performance of trained on different datasets. We use area under Precision-Recall curve (AUC) as our evaluation metric. In our results, Training data (T) denotes trained using training split of the corresponding dataset. The effect ofWikiMed and PubMedDS is analyzed by first using them for training and then fine-tuning the model on the training split. This is denoted by and respectively. Finally, denotes the combined model which utilizes both the datasets. concatenates BERT encoding from and models and passing it to a classifier which is trained using the training dataset. The overall results are presented in Table 3. We find that there is a substantial gain in performance on using WikiMed and PubMedDS along with the training data. Further, we find that the combined model which allows to incorporate the benefits from both the corpora gives the best performance on all dataset. This shows that both the datasets contain complimentary information relevant for entity disambiguation. Overall, we get an average absolute increase of 6.7, 6.6, and 9.4 AUC from WikiMed, PubMedDS, and the combined model respectively.
7.3 Analyzing Improvement from WikiMed and PubMedDS
In this section, we investigate the cause behind the improvement in performance on entity disambiguation from WikiMed and PubMedDS datasets as discussed in the previous section (Section 7.2). Here, we report F1-score on prominent semantic types for T, , and models. In Figure 2, we report the improvement in performance for two dataset on which we obtain the highest gains. The results show that for semantic types such as Disorder, utilizing WikiMed and PubMedDS give an absolute increase of 32, 12 on NCBI and 9, 20 on ShARe respectively. Moreover, on semantic types such as Sign or Symptoms for which training data provide no coverage, utilizing additional corpus leads to drastic increase in performance on NCBI and a substantial gain on ShARe dataset. The results thus give insight behind the overall gain which we obtain on utilizing these corpora.
Combining with Biomedical WSD methods Disambiguating the candidate concepts produced by medical entity extraction pipelines has been a long-standing area of research, with several tools developed to integrate with existing pipelines. The YTEX suite of algorithms Garla2011; Garla2012 extends both MetaMap and cTAKES with a disambiguation module that helps to reduce noise considerably, although Osborne2013 found that it often over-filtered correct concepts. These methods can be combined with to create a multi-stage filtering approach for disambiguation. performs coarse filtering to a high-confidence set based on predicted type, a key step for narrowing down over-generated candidate sets in open-ended deep learning systems; disambiguation methods can then perform fine-grained selection of the correct candidate to further improve entity linking performance. We highlight this as an important direction for future work on medical entity linking.
Novel, high-precision datasets: Automatic dataset creation is prone to a variety of errors. In order to assess the quality of the distant supervision used to create the PubMedDS dataset, we identified the subset of documents overlapping with three manually-annotated datasets using PubMed abstracts: MedMentions, NCBI, and Bio CDR. We note that all the documents in the three datasets are part of PubMedDS. This allowed us to evaluate the precision and coverage of our distantly-supervised mentions with respect to human annotations. The results of this analysis are reported in Table 4. Reflecting the strict requirements for keeping a mention in our dataset (identification with a NER tool and exact match to a provided MeSH header), we find that the coverage of all possible mentions in the document is low, but precision is relatively high at around 72%. Thus, the data that are provided in PubMedDS are of high quality for training entity linking models.
WikiMed was created using human-provided links and expert-curated mappings, and can, therefore, be considered inherently high precision. However, Wikipedia editors are encouraged to only link early occurrences of an entity in the articles, as opposed to exhaustive annotations; thus, as with our findings in PubMedDS, we can consider WikiMed to be relatively low in coverage of all true mentions in the documents. This is reflected in the low precision numbers reported in Table 2: by relying on string matching heuristics, the entity linking toolkits are identifying concept mentions that were not linked to the relevant page by humans. Thus, our novel datasets are not good fits for training medical NER models but do provide a reliable signal for entity linking.
We presented , a novel tool for improving medical entity linking by predicting the semantic type of a medical concept mention and filtering out candidate concepts of the wrong type. improves entity linking performance for a variety of popular medical entity extraction toolkits across several benchmark datasets, clearly demonstrating the utility of a type-based filtering step in medical entity linking. We further present two novel large-scale, automatically-created datasets of medical entity mentions: WikiMed, a Wikipedia-based dataset for cross-domain transfer learning, and PubMedDS, a distantly-supervised dataset of medical entity mentions in biomedical abstracts. Pre-training on these datasets substantially improves performance, and we share these datasets with the community as a resource for medical entity linking research.
Appendix A Mention Category Distribution
|Activities & Behaviors||4||7||1||12,249||420||164,915|
|Chemicals & Drugs||0||32,436||1||46,420||19,033||3,458,278|
|Concepts & Ideas||0||0||1||60,475||1,618||230,383|
|Disease or Syndrome||10,760||22,603||5,895||11,709||53,564||438,327|
|Genes & Molecular Sequences||20||0||0||5,582||341||3,702|
|Mental or Behavioral Dysfunction||293||3,657||410||2,463||14,444||94,852|
|Sign or Symptom||211||9,844||2,687||1809||2,103||233,122|