Biomedical Entity Linking with Contrastive Context Matching

by   Shogo Ujiie, et al.

We introduce BioCoM, a contrastive learning framework for biomedical entity linking that uses only two resources: a small-sized dictionary and a large number of raw biomedical articles. Specifically, we build the training instances from raw PubMed articles by dictionary matching and use them to train a context-aware entity linking model with contrastive learning. We predict the normalized biomedical entity at inference time through a nearest-neighbor search. Results found that BioCoM substantially outperforms state-of-the-art models, especially in low-resource settings, by effectively using the context of the entities.



page 1

page 2

page 3

page 4


Clustering-based Inference for Zero-Shot Biomedical Entity Linking

Due to large number of entities in biomedical knowledge bases, only a sm...

MOLEMAN: Mention-Only Linking of Entities with a Mention Annotation Network

We present an instance-based nearest neighbor approach to entity linking...

A Lightweight Neural Model for Biomedical Entity Linking

Biomedical entity linking aims to map biomedical mentions, such as disea...

Low Resource Recognition and Linking of Biomedical Concepts from a Large Ontology

Tools to explore scientific literature are essential for scientists, esp...

End-to-end Biomedical Entity Linking with Span-based Dictionary Matching

Disease name recognition and normalization, which is generally called bi...

Trajectory-Based Spatiotemporal Entity Linking

Trajectory-based spatiotemporal entity linking is to match the same movi...

Entity-driven Fact-aware Abstractive Summarization of Biomedical Literature

As part of the large number of scientific articles being published every...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Biomedical entity normalization, often referred to as entity linking in general domains, is the task of mapping entity mentions to unified concepts in a biomedical knowledge graph. This is useful preprocessing task for many downstreaming tasks (e.g., information extraction 

Lee et al. (2016) and relation extraction Xu et al. (2016)). This task is challenging as similar-looking words can have different meanings and same concepts can have different surface forms. For example, while “TDP-43 proteinopathy” and “TdP” appear to be similar, they refer to different concepts. “TDP-43 proteinopathy” is a generic term for diseases caused by abnormality of “TDP-43” protein in the nervous system (e.g., amyotrophic lateral sclerosis), but “TdP” is a type of arrhythmia.

The approach used to normalize biomedical entities involves the linking entity mentions to the most similar synonym (and the corresponding concept) defined in the dictionary, where similarity is determined using representations of entities and synonyms. Various approaches, such as supervised learning on a large amount of labeled dataset 

Mondal et al. (2019); Sung et al. (2020); Schumacher et al. (2020); Ujiie et al. (2021) and self-supervised learning on a large dictionary Liu et al. (2021), have been introduced to obtain representations of entities. Although these methods have achieved high performance, they use large-scale, difficult-to-obtain resources (e.g., manually annotated datasets or a large-sized dictionary).

Figure 1: Overview of our framework, BioCoM. We build the training instances from raw biomedical articles by dictionary matching and use it to train the context-aware biomedical entity normalization model with contrastive learning.

This study aims to develop a biomedical entity normalization methodology for diseases that is effective even for limited resource scenarios. Specifically, our work is under the following assumptions: unavailability of manually annotated data and provision of access only to a small-sized dictionary (i.e., dictionary where the number of synonyms for each concept is small). These are practical assumptions in medical domain studies as biomedical concept annotations are costly because domain knowledge is required. Note that diseases are some of the most important biomedical entities. Therefore, this study, similar to most studies conducted by the biomedical text processing community, focuses on the normalization of disease mentions. Leaman et al. (2013); Lou et al. (2017).

This study proposes BioCoM, a contrastive learning framework for context-aware biomedical entity normalization. BioCoM only uses a small-sized dictionary and a large number of raw biomedical articles. Figure 1 illustrates the overview of BioCoM. As training instances to consider the context of the entities, we construct the corpus of the entity-linked sentences, i.e., a set of sentences that contain synonyms in the dictionary, from raw biomedical articles. Accordingly, we show that the context obtained from the corpus strongly benefits the entity-linking model. BioCoM can obtain cues for linking from the words surrounding the entities, such as synonyms that are not in the dictionary but are present in the sentence. Thus, it works well even when the size of the dictionary is small.

The following are the contributions of this study: (i) introduction of a framework for biomedical entity normalization that only requires a small-sized dictionary and a large amount of raw biomedical articles (ii) execution of the evaluation for limited-resource settings without manually annotated dataset or other large external resources (iii) achievement of state-of-the-art performance on three biomedical entity linking datasets, particularly significant in limited-resource scenarios. The code is available at

2 Related Work

Distributed representations of entities play an important role in biomedical entity normalization tasks  Mondal et al. (2019); Li et al. (2017); Schumacher et al. (2020). Recently, bidirectional encoder representations from transformer (BERT) models Devlin et al. (2019) pre-trained on large biomedical documents have shown their effectiveness in biomedical entity normalization tasks. For example, BioSyn Sung et al. (2020), which uses BioBERT Lee et al. (2020) to obtain entity representations and fine-tunes them using synonym marginalization, has outperformed many existing methods in biomedical entity normalization tasks. However, this approach requires a large annotated dataset for training.

Conversely, SapBert Liu et al. (2021) uses only a large biomedical knowledge graph, rather than an annotated dataset, for fine-tuning the entity representation and yields state-of-the-art performance in the biomedical entity normalization task. However, SapBert’s performance is highly dependent on the size of the knowledge graph, and we have experimentally confirmed that the normalization performance degrades significantly when available resources are limited.

Our framework, BioCoM, could be assumed to be in line with SapBert, which does not use an annotated training dataset. However, BioCoM further considers the context of entities using contextual entity representations derived from entity-linked sentences.

3 Methods

We propose a contrastive learning framework called BioCoM for context-aware biomedical entity normalization without any manually annotated datasets or large-scaled dictionaries. “Context-aware” indicates that BioCoM considers both entity mentions and entire sentences as input. Formally, given a sentence and -th entity mention , we predict the corresponding concept , where is a set of concepts in the dictionary. In this framework, input entity mentions and synonyms are encoded into the contextual representations. These representations are learned from entity-linked sentences’ corpus through contrastive learning He et al. (2020); Chen et al. (2020). At inference time, the input entity is normalized by context matching from all the entities in the corpus. We start with the corpus construction process and then the training procedure and inferences.

3.1 Entity-Linked Sentences

To obtain the synonyms’ context in the dictionary, the corpus, a set of entity-linked sentences wherein the entities (synonyms in this case) are linked to the corresponding concepts, is required. Here, the corpus is built from raw biomedical articles by dictionary matching against synonyms (Figure 1). We assume that we can access a database , where is the -th entity mention and concept pair in a sequence of mention and concept pairs in the -th sentence. Please refer to Section 4.1, discussing the details of the corpus used in our experiments.

3.2 Contrastive Learning

To conduct context matching, the entities in the database should be embedded into the feature space where entities with the same concept are close to each other. Hence, we adopt contrastive learning, which is a framework used to learn similar/dissimilar representations from the similar/dissimilar pairs. In this study, the positive (similar) and negative (dissimilar) pairs come from the mini-batch of the sentences. That is, given an entity in the mini-batch, entities with the same concept within the mini-batch are treated as positive samples, while others are negative samples.

Regarding the loss function, we use Multi-Similarity loss 

Wang et al. (2019), which is a metric-learning loss function that considers relative similarities between positive and negative pairs. Let us denote the set of entities in the mini-batch by and the set of positive and negative samples for the entity by and

. We define the cosine similarity of two entities

and as , resulting in a similarity matrix . Based on , , and , the following training objectives are set:

where are the temperature scales and is the offset applied on . For pair mining, we follow the original paper Wang et al. (2019).

3.3 Context Matching

Our approach is to identify a concept for the entity mention by context matching from the database. Specifically, the problem can be defined as the -nearest neighbor search Fix and Hodges (1989) using the contextual entity representations. The neighbors are obtained from the database based on the cosine similarity with the contextual representations of the target entity mention . During inference, we predict to be the concept that is the most frequent in .

4 Experiments

4.1 Experimental Setting


Herein, we use MEDIC Davis et al. (2012) as the dictionary . MEDIC lists 13,063 diseases from OMIM and MeSH and have over 70,000 synonyms linked to the concepts. To limit the dictionary’s size, we randomly sampled half of the synonyms for each concept. Finally, the dictionary used in our experiments has three synonyms (on average) for each concept.

PubMed abstracts were used to construct entity-linked sentences. The articles that appear in the test dataset were filtered out, obtaining in approximately 100M sentences. The synonyms in MEDIC that appear in the sentences are then linked to the corresponding concepts. We chose the longest match for any two overlapping synonyms appeared in sentences.

Test Datasets & Evaluation Metric

We evaluated BioCoM on three datasets for the biomedical entity normalization task: NCBI disease corpus (NCBID) Doğan et al. (2014), BioCreative V Chemical Disease Relation (BC5CDR) Li et al. (2016), and MedMentions Mohan and Li (2018). Following previous studies D’Souza and Ng (2015); Mondal et al. (2019)

, we used the accuracy as the evaluation metric.

Given that BC5CDR and MedMentions contain mentions whose concepts are not in MEDIC, these were filtered out during the evaluation. We refer to these as “BC5CDR-d” and “MedMentions-d” respectively.

Model Details

The contextual representation for each entity was obtained from PubMedBERT Gu et al. (2020), which was trained on a large number of PubMed abstracts using BERT Devlin et al. (2019). Specifically, the entity tokens were wrapped with special tokens, [ENT] and [/ENT], to mark the beginning and end of the mention, and feed this modified sequences to PubMedBERT. We use the representations of [ENT] as contextual representations Soares et al. (2019) of the entity that follows. Diagram for this entity representations can be found in Supplementary Materials.

The number of the nearest neighbors was set to as it yielded the highest performance in the development set of MedMentions-d.

4.2 Results

Methods NCBID BC5CDR-d MedMentions-d
TF-IDF 45.3 59.9 57.8
PubMedBERT 34.6 42.2 45.2
SapBert 56.5 65.2 65.1
BioCoM 60.3 69.0 71.4
Table 1: Accuracy on three datasets. Note that the dictionary and target concepts that are used for training and inference are different from previous work Liu et al. (2021).
Label: D014178 Input: … phenotype B-I for B-ALL, phenotype … the presence of translocation t(4;11))
SapBert [D054868] partial 11q monosomy syndrome
BioCoM [D014178] … involved in genetic translocations characteristic of b-cell acute lymphoblastic leukemia (B-cell ALL).
Label: D002292 Input: High occurrence of non-clear cell renal cell carcinoma
SapBert [D002292] renal cell carcinomas
BioCoM [D002289] … used to treat renal cell carcinoma, non-small-cell lung cancer and colon cancer …
Table 2: Predicted samples from SapBert and BioCoM in MedMentions-d. The [predicted concept] and the nearest neighbor are shown. Entity mentions and the cue phrases are written in bold and italic, respectively.

We compared the performance of BioCoM with three baseline models: tf-idf, PubMedBERT Gu et al. (2020), and SapBert Liu et al. (2021). All these models did not require manually annotated datasets and only used entity mentions as inputs. Please refer to the Supplementary Materials for the details of these models.

Table 1 shows that BioCoM obtains significant improvements over the baseline models for all the datasets. This demonstrates that the context-matching approach – based on the contextual entity representations learned from the automatically constructed corpus – is effective for biomedical entity normalization.

5 Discussion

5.1 Effect of the Dictionary Size

To determine the effect of the number of synonyms, we performed experiments with different numbers of synonyms in the dictionary: 39,959, 86,205, 151,559, and 218,325. To augment synonyms, we use ones from UMLS Bodenreider (2004), a large biomedical knowledge graph, and the remaining from MEDIC.

Figure 2 illustrates the accuracy on MedMentions-d for each dictionary size. This indicates that BioCoM yields even better results when the number of synonyms in the available dictionary is small while keeping superior results for the fully-expanded dictionary.

Figure 2: Effect of the number of synonyms on MedMentions-d.

5.2 Qualitative Analysis

Here, we qualitatively analyze BioCoM to understand its behavior. Table 2 shows the predicted concepts and the nearest neighbors for input entity mentions in the MedMentions-d.

In the first example, BioCoM predicts the concept correctly while SapBert fails to find the correct answer. This is because BioCoM utilizes the context as the cue to normalize the entity mention (“B-ALL” in the input sentence and “B-cell ALL” in the nearest neighbor sentence). This result strongly demonstrates the effectiveness of the context-matching approach and BioCoM can successfully learn the contextual entity representations by focusing on the cues within the sentences.

The second example shows an erroneous prediction from BioCoM owing to the annotation mistake. We can determine that the entity mention “carcinoma, non-small-cell lung” is wrongly recognized in place of “renal cell carcinomas” in the nearest neighbor sentence. This is because BioCoM uses the longest match algorithm to recognize entity mentions. Thus, the fact that ”carcinoma, non-small-cell lung” is longer yields a match. This causes a serious mismatch, wherein BioCoM

successfully focuses on “renal cell carcinomas” to encode the entities, but the linked mention is ”carcinoma, non-small-cell lung.” This mismatch can be addressed by adopting named entity recognition system if available.

6 Conclusion

We introduced BioCoM, a contrastive learning framework for context-aware biomedical entity normalization without manually annotated datasets or large-sized dictionaries. Our experiments on three datasets showed that our model significantly outperformed state-of-the-art models, especially when the available resources are limited.


  • Bodenreider (2004) Olivier Bodenreider. 2004. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res., 32(Database issue):D267–70.
  • Chen et al. (2020) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In Proc. of ICML, volume 119, pages 1597–1607.
  • Davis et al. (2012) Allan Peter Davis, Thomas C Wiegers, Michael C Rosenstein, and Carolyn J Mattingly. 2012. MEDIC: a practical disease vocabulary used at the comparative toxicogenomics database. Database, 2012:bar065.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. of NAACL-HLT, pages 4171–4186.
  • Doğan et al. (2014) Rezarta Islamaj Doğan, Robert Leaman, and Zhiyong Lu. 2014. NCBI disease corpus: a resource for disease name recognition and concept normalization. J. Biomed. Inform., 47:1–10.
  • D’Souza and Ng (2015) Jennifer D’Souza and Vincent Ng. 2015. Sieve-Based entity linking for the biomedical domain. In Proc. of ACL-IJCNLP, pages 297–302.
  • Fix and Hodges (1989) Evelyn Fix and J L Hodges. 1989. Discriminatory analysis. nonparametric discrimination: Consistency properties. Int. Stat. Rev., 57(3):238–247.
  • Gu et al. (2020) Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. 2020. Domain-specific language model pretraining for biomedical natural language processing. arXiv preprint arXiv:2007.15779.
  • He et al. (2020) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proc. of CVPR, pages 9729–9738.
  • Leaman et al. (2013) Robert Leaman, Rezarta Islamaj Dogan, and Zhiyong Lu. 2013. DNorm: disease name normalization with pairwise learning to rank. Bioinformatics, 29(22):2909–2917.
  • Lee et al. (2020) Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
  • Lee et al. (2016) Sunwon Lee, Donghyeon Kim, Kyubum Lee, Jaehoon Choi, Seongsoon Kim, Minji Jeon, Sangrak Lim, Donghee Choi, Sunkyu Kim, Aik-Choon Tan, and Jaewoo Kang. 2016. BEST: Next-Generation biomedical entity search tool for knowledge discovery from biomedical literature. PLoS One, 11(10):e0164680.
  • Li et al. (2017) Haodi Li, Qingcai Chen, Buzhou Tang, Xiaolong Wang, Hua Xu, Baohua Wang, and Dong Huang. 2017. CNN-based ranking for biomedical entity normalization. BMC Bioinformatics, 18(Suppl 11):385.
  • Li et al. (2016) Jiao Li, Yueping Sun, Robin J Johnson, Daniela Sciaky, Chih-Hsuan Wei, Robert Leaman, Allan Peter Davis, Carolyn J Mattingly, Thomas C Wiegers, and Zhiyong Lu. 2016. BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database, 2016:baw068.
  • Liu et al. (2021) Fangyu Liu, Ehsan Shareghi, Zaiqiao Meng, Marco Basaldella, and Nigel Collier. 2021. Self-alignment pretraining for biomedical entity representations. In Proc. of NAACL-HLT, pages 4228–4238.
  • Lou et al. (2017) Yinxia Lou, Yue Zhang, Tao Qian, Fei Li, Shufeng Xiong, and Donghong Ji. 2017. A transition-based joint model for disease named entity recognition and normalization. Bioinformatics, 33(15):2363–2371.
  • Mohan and Li (2018) Sunil Mohan and Donghui Li. 2018. Medmentions: A large biomedical corpus annotated with umls concepts. In Proc. of Automated Knowledge Base Construction (AKBC).
  • Mondal et al. (2019) Ishani Mondal, Sukannya Purkayastha, Sudeshna Sarkar, Pawan Goyal, Jitesh Pillai, Amitava Bhattacharyya, and Mahanandeeshwar Gattu. 2019. Medical entity linking using triplet network. In Proc. of ClinicalNLP, pages 95–100.
  • Schumacher et al. (2020) Elliot Schumacher, Andriy Mulyar, and Mark Dredze. 2020. Clinical concept linking with contextualized neural representations. In Proc. of ACL, pages 8585–8592.
  • Soares et al. (2019) Livio Baldini Soares, Nicholas FitzGerald, Jeffrey Ling, and Tom Kwiatkowski. 2019. Matching the blanks: Distributional similarity for relation learning. In Proc. of ACL, pages 2895–2905.
  • Sung et al. (2020) Mujeen Sung, Hwisang Jeon, Jinhyuk Lee, and Jaewoo Kang. 2020. Biomedical entity representations with synonym marginalization. In Proc. of ACL, pages 3641–3650.
  • Ujiie et al. (2021) Shogo Ujiie, Hayate Iso, Shuntaro Yada, Shoko Wakamiya, and Eiji Aramaki. 2021. End-to-end biomedical entity linking with span-based dictionary matching. In Proc. of the 20th Workshop on Biomedical Language Processing, pages 162–167.
  • Wang et al. (2019) Xun Wang, Xintong Han, Weilin Huang, Dengke Dong, and Matthew R Scott. 2019. Multi-similarity loss with general pair weighting for deep metric learning. In Proc. of CVPR.
  • Xu et al. (2016) Jun Xu, Yonghui Wu, Yaoyun Zhang, Jingqi Wang, Hee-Jin Lee, and Hua Xu. 2016. CD-REST: a system for extracting chemical-induced disease relation in literature. Database, 2016.

Appendix A Data Sampling Strategy

For every mini-batch, we randomly chose a certain number of concepts and randomly sampled two sentences, including a synonym of the concept. This was performed so that at least one positive pair exists for each concept within the mini-batch. Here, we chose 16 concepts for each mini-batch, thus obtaining 32 unique sentences per mini-batch.

Appendix B Dataset Details

The dataset used in our experiments is threefold: the NCBI disease corpus (NCBID) Doğan et al. (2014), the BioCreative V Chemical Disease Relation (BC5CDR) task corpus  Li et al. (2016), and MedMentions Mohan and Li (2018). We list the number of documents and disease mentions in each dataset in Table 3.

dataset # of documents # of mentions
NCBID 100 960
BC5CDR 500 4,424
MedMentions 879 3,795
Table 3: Number of documents and entity mentions in each dataset.


NCBI disease corpus consists of 793 PubMed titles and abstracts, each of which contains manually annotated disease mentions. They are manually separated into training (593), development (100), and test (100) sets. Each disease mentions are mapped into the CUI that is contained in the MEDIC dictionary  Davis et al. (2012).


BC5CDR is the task of chemical-induced relation extraction comprising 1500 PubMed abstracts. Abstracts are equally separated into training, development, and test sets. BC5CDR provides manually annotated disease and chemical entities, which are mapped into MEDIC and comparative toxicogenomics database (CTD) chemical dictionaries, respectively. We only used disease entities linked to the concepts in MEDIC.


MedMentions is the set of large-scale annotated resources used to recognize the biomedical concepts. It provides more than 4,000 PubMed abstracts and over 350,000 mentions linked to the concepts in unified medical language system (UMLS). For our experiments, we converted concept unique identifiers in UMLS to the entities in MEDIC using UMLS. As UMLS contains many varieties of medical concepts, such as drugs and genes, we filtered the entities that did not refer to diseases. We refer to these as MedMentions-d.

Appendix C Hyperparameters

At training, we randomly selected 16 concepts and extracted two sentences for each concept (i.e., 32 sentences for each mini-batch). Models were trained with the temperature parameter set to 2, set to 50, the offset set to 1, and the learning parameter set to 1e-5.

Appendix D Baseline Models

The baseline entity representations used herein are as follows:

  • Tf-idf: Tf-idf of character level uni-grams and bi-grams. This achieves comparative performance as reported in previous works Sung et al. (2020).

  • PubMedBERT: The representations of [CLS] from PubMedBERT without fine-tuning.

  • SapBert: The representations of [CLS] from PubMedBERT trained similar to that in Liu et al. (2021). Note that the size of the dictionary used in training and inference is equal to that used for our model, and is different from the original paper.

Appendix E Entity Tag

Methods NCBID BC5CDR MedMentions
BioCoM-all 60.3 69.0 71.4
BioCoM-one 59.3 66.9 70.7
Table 4: Entity tag “all” implies that all of entities in the sentence are linked, and “one” donates that only one entity is linked in the sentence.

In practice, we need to identify disease mentions somehow (e.g. named entity recognition system). As the database constructed with our method is only linked to the concepts in the dictionary, a the gap exists between the input sentence and database. This may cause unexpected attention to the [ENT] tags for the mentions that are not included in the dictionary. One possible method to alleviate this problem is to link only one entity in a sentence. If a sentence has multiple entities, we established linked sentences in which each entity is linked. We trained and evaluated the model in this setting on MedMentions.

Table 4 shows the result. The model using the sentences in which only one entity is linked performs poorly. One possible reason for this is the small number of entities in the mini-batch. Thus, number of negative samples is not enough for training.

Appendix F Effect of the Corpus Size

Figure 3: Number of sentences vs accuracy

When large numbers of documents are used for nearest neighbor search, using all sentences for normalization is memory-intensive. We investigated how we can reduce memory usage for prediction. One possible way is random sampling of the sentences in the database. We evaluated the model with the sentences for each dictionary mention of 1, 5, 10, 20 on MedMentions-d.

Figure 3 shows the results. Surprisingly, we found that the accuracy of the model with random sampling was higher than unity in all the sentences. One possible reason for this is that dictionary names consisting of common words (e.g., cancer and tumor) are linked in many sentences. These sentences provide harmful noise for our model. Regarding the number of sentences, although the model with 20 sentences for each concept performed best, the model can normalize the mentions with a fewer number of sentences. This implies that our model learns robust contextual entity representations.

Label: D000077273 Input: … histopathology consisted of papillary thyroid carcinoma (PTC) (n = 91, 86.7%).
SapBert [C536943] TCC
BioCoM [D000077273] … for detecting cervical lymph node (LN) metastasis in papillary thyroid carcinoma (PTC).
Label: D015430 Input: … generations on offspring metabolic traits , including weight and fat gain, …
SapBert [D015430] gain, weight
Proposed [D001835] … more elderly and had lower weights , body mass indices and arm and calf circumferences …
Table 5: More prediction samples from SapBert and BioCoM. Each example shows the [predicted concept] and the nearest neighbors. Entity mentions and the cue phrases are written in bold and italic forms, respectively.