We introduce BioCoM, a contrastive learning framework for biomedical entity linking that uses only two resources: a small-sized dictionary and a large number of raw biomedical articles. Specifically, we build the training instances from raw PubMed articles by dictionary matching and use them to train a context-aware entity linking model with contrastive learning. We predict the normalized biomedical entity at inference time through a nearest-neighbor search. Results found that BioCoM substantially outperforms state-of-the-art models, especially in low-resource settings, by effectively using the context of the entities.READ FULL TEXT VIEW PDF
Biomedical entity normalization, often referred to as entity linking in general domains, is the task of mapping entity mentions to unified concepts in a biomedical knowledge graph. This is useful preprocessing task for many downstreaming tasks (e.g., information extractionLee et al. (2016) and relation extraction Xu et al. (2016)). This task is challenging as similar-looking words can have different meanings and same concepts can have different surface forms. For example, while “TDP-43 proteinopathy” and “TdP” appear to be similar, they refer to different concepts. “TDP-43 proteinopathy” is a generic term for diseases caused by abnormality of “TDP-43” protein in the nervous system (e.g., amyotrophic lateral sclerosis), but “TdP” is a type of arrhythmia.
The approach used to normalize biomedical entities involves the linking entity mentions to the most similar synonym (and the corresponding concept) defined in the dictionary, where similarity is determined using representations of entities and synonyms. Various approaches, such as supervised learning on a large amount of labeled datasetMondal et al. (2019); Sung et al. (2020); Schumacher et al. (2020); Ujiie et al. (2021) and self-supervised learning on a large dictionary Liu et al. (2021), have been introduced to obtain representations of entities. Although these methods have achieved high performance, they use large-scale, difficult-to-obtain resources (e.g., manually annotated datasets or a large-sized dictionary).
This study aims to develop a biomedical entity normalization methodology for diseases that is effective even for limited resource scenarios. Specifically, our work is under the following assumptions: unavailability of manually annotated data and provision of access only to a small-sized dictionary (i.e., dictionary where the number of synonyms for each concept is small). These are practical assumptions in medical domain studies as biomedical concept annotations are costly because domain knowledge is required. Note that diseases are some of the most important biomedical entities. Therefore, this study, similar to most studies conducted by the biomedical text processing community, focuses on the normalization of disease mentions. Leaman et al. (2013); Lou et al. (2017).
This study proposes BioCoM, a contrastive learning framework for context-aware biomedical entity normalization. BioCoM only uses a small-sized dictionary and a large number of raw biomedical articles. Figure 1 illustrates the overview of BioCoM. As training instances to consider the context of the entities, we construct the corpus of the entity-linked sentences, i.e., a set of sentences that contain synonyms in the dictionary, from raw biomedical articles. Accordingly, we show that the context obtained from the corpus strongly benefits the entity-linking model. BioCoM can obtain cues for linking from the words surrounding the entities, such as synonyms that are not in the dictionary but are present in the sentence. Thus, it works well even when the size of the dictionary is small.
The following are the contributions of this study: (i) introduction of a framework for biomedical entity normalization that only requires a small-sized dictionary and a large amount of raw biomedical articles (ii) execution of the evaluation for limited-resource settings without manually annotated dataset or other large external resources (iii) achievement of state-of-the-art performance on three biomedical entity linking datasets, particularly significant in limited-resource scenarios. The code is available at https://github.com/ujiuji1259/biocom.
Distributed representations of entities play an important role in biomedical entity normalization tasks Mondal et al. (2019); Li et al. (2017); Schumacher et al. (2020). Recently, bidirectional encoder representations from transformer (BERT) models Devlin et al. (2019) pre-trained on large biomedical documents have shown their effectiveness in biomedical entity normalization tasks. For example, BioSyn Sung et al. (2020), which uses BioBERT Lee et al. (2020) to obtain entity representations and fine-tunes them using synonym marginalization, has outperformed many existing methods in biomedical entity normalization tasks. However, this approach requires a large annotated dataset for training.
Conversely, SapBert Liu et al. (2021) uses only a large biomedical knowledge graph, rather than an annotated dataset, for fine-tuning the entity representation and yields state-of-the-art performance in the biomedical entity normalization task. However, SapBert’s performance is highly dependent on the size of the knowledge graph, and we have experimentally confirmed that the normalization performance degrades significantly when available resources are limited.
Our framework, BioCoM, could be assumed to be in line with SapBert, which does not use an annotated training dataset. However, BioCoM further considers the context of entities using contextual entity representations derived from entity-linked sentences.
We propose a contrastive learning framework called BioCoM for context-aware biomedical entity normalization without any manually annotated datasets or large-scaled dictionaries. “Context-aware” indicates that BioCoM considers both entity mentions and entire sentences as input. Formally, given a sentence and -th entity mention , we predict the corresponding concept , where is a set of concepts in the dictionary. In this framework, input entity mentions and synonyms are encoded into the contextual representations. These representations are learned from entity-linked sentences’ corpus through contrastive learning He et al. (2020); Chen et al. (2020). At inference time, the input entity is normalized by context matching from all the entities in the corpus. We start with the corpus construction process and then the training procedure and inferences.
To obtain the synonyms’ context in the dictionary, the corpus, a set of entity-linked sentences wherein the entities (synonyms in this case) are linked to the corresponding concepts, is required. Here, the corpus is built from raw biomedical articles by dictionary matching against synonyms (Figure 1). We assume that we can access a database , where is the -th entity mention and concept pair in a sequence of mention and concept pairs in the -th sentence. Please refer to Section 4.1, discussing the details of the corpus used in our experiments.
To conduct context matching, the entities in the database should be embedded into the feature space where entities with the same concept are close to each other. Hence, we adopt contrastive learning, which is a framework used to learn similar/dissimilar representations from the similar/dissimilar pairs. In this study, the positive (similar) and negative (dissimilar) pairs come from the mini-batch of the sentences. That is, given an entity in the mini-batch, entities with the same concept within the mini-batch are treated as positive samples, while others are negative samples.
Regarding the loss function, we use Multi-Similarity lossWang et al. (2019), which is a metric-learning loss function that considers relative similarities between positive and negative pairs. Let us denote the set of entities in the mini-batch by and the set of positive and negative samples for the entity by and
. We define the cosine similarity of two entitiesand as , resulting in a similarity matrix . Based on , , and , the following training objectives are set:
where are the temperature scales and is the offset applied on . For pair mining, we follow the original paper Wang et al. (2019).
Our approach is to identify a concept for the entity mention by context matching from the database. Specifically, the problem can be defined as the -nearest neighbor search Fix and Hodges (1989) using the contextual entity representations. The neighbors are obtained from the database based on the cosine similarity with the contextual representations of the target entity mention . During inference, we predict to be the concept that is the most frequent in .
Herein, we use MEDIC Davis et al. (2012) as the dictionary . MEDIC lists 13,063 diseases from OMIM and MeSH and have over 70,000 synonyms linked to the concepts. To limit the dictionary’s size, we randomly sampled half of the synonyms for each concept. Finally, the dictionary used in our experiments has three synonyms (on average) for each concept.
PubMed abstracts were used to construct entity-linked sentences. The articles that appear in the test dataset were filtered out, obtaining in approximately 100M sentences. The synonyms in MEDIC that appear in the sentences are then linked to the corresponding concepts. We chose the longest match for any two overlapping synonyms appeared in sentences.
We evaluated BioCoM on three datasets for the biomedical entity normalization task: NCBI disease corpus (NCBID) Doğan et al. (2014), BioCreative V Chemical Disease Relation (BC5CDR) Li et al. (2016), and MedMentions Mohan and Li (2018). Following previous studies D’Souza and Ng (2015); Mondal et al. (2019)
, we used the accuracy as the evaluation metric.
Given that BC5CDR and MedMentions contain mentions whose concepts are not in MEDIC, these were filtered out during the evaluation. We refer to these as “BC5CDR-d” and “MedMentions-d” respectively.
The contextual representation for each entity was obtained from PubMedBERT Gu et al. (2020), which was trained on a large number of PubMed abstracts using BERT Devlin et al. (2019). Specifically, the entity tokens were wrapped with special tokens, [ENT] and [/ENT], to mark the beginning and end of the mention, and feed this modified sequences to PubMedBERT. We use the representations of [ENT] as contextual representations Soares et al. (2019) of the entity that follows. Diagram for this entity representations can be found in Supplementary Materials.
The number of the nearest neighbors was set to as it yielded the highest performance in the development set of MedMentions-d.
|Label: D014178||Input: … phenotype B-I for B-ALL, phenotype … the presence of translocation t(4;11))|
|SapBert||[D054868] partial 11q monosomy syndrome|
|BioCoM||[D014178] … involved in genetic translocations characteristic of b-cell acute lymphoblastic leukemia (B-cell ALL).|
|Label: D002292||Input: High occurrence of non-clear cell renal cell carcinoma|
|SapBert||[D002292] renal cell carcinomas|
|BioCoM||[D002289] … used to treat renal cell carcinoma, non-small-cell lung cancer and colon cancer …|
We compared the performance of BioCoM with three baseline models: tf-idf, PubMedBERT Gu et al. (2020), and SapBert Liu et al. (2021). All these models did not require manually annotated datasets and only used entity mentions as inputs. Please refer to the Supplementary Materials for the details of these models.
Table 1 shows that BioCoM obtains significant improvements over the baseline models for all the datasets. This demonstrates that the context-matching approach – based on the contextual entity representations learned from the automatically constructed corpus – is effective for biomedical entity normalization.
To determine the effect of the number of synonyms, we performed experiments with different numbers of synonyms in the dictionary: 39,959, 86,205, 151,559, and 218,325. To augment synonyms, we use ones from UMLS Bodenreider (2004), a large biomedical knowledge graph, and the remaining from MEDIC.
Figure 2 illustrates the accuracy on MedMentions-d for each dictionary size. This indicates that BioCoM yields even better results when the number of synonyms in the available dictionary is small while keeping superior results for the fully-expanded dictionary.
Here, we qualitatively analyze BioCoM to understand its behavior. Table 2 shows the predicted concepts and the nearest neighbors for input entity mentions in the MedMentions-d.
In the first example, BioCoM predicts the concept correctly while SapBert fails to find the correct answer. This is because BioCoM utilizes the context as the cue to normalize the entity mention (“B-ALL” in the input sentence and “B-cell ALL” in the nearest neighbor sentence). This result strongly demonstrates the effectiveness of the context-matching approach and BioCoM can successfully learn the contextual entity representations by focusing on the cues within the sentences.
The second example shows an erroneous prediction from BioCoM owing to the annotation mistake. We can determine that the entity mention “carcinoma, non-small-cell lung” is wrongly recognized in place of “renal cell carcinomas” in the nearest neighbor sentence. This is because BioCoM uses the longest match algorithm to recognize entity mentions. Thus, the fact that ”carcinoma, non-small-cell lung” is longer yields a match. This causes a serious mismatch, wherein BioCoM
successfully focuses on “renal cell carcinomas” to encode the entities, but the linked mention is ”carcinoma, non-small-cell lung.” This mismatch can be addressed by adopting named entity recognition system if available.
We introduced BioCoM, a contrastive learning framework for context-aware biomedical entity normalization without manually annotated datasets or large-sized dictionaries. Our experiments on three datasets showed that our model significantly outperformed state-of-the-art models, especially when the available resources are limited.
For every mini-batch, we randomly chose a certain number of concepts and randomly sampled two sentences, including a synonym of the concept. This was performed so that at least one positive pair exists for each concept within the mini-batch. Here, we chose 16 concepts for each mini-batch, thus obtaining 32 unique sentences per mini-batch.
The dataset used in our experiments is threefold: the NCBI disease corpus (NCBID) Doğan et al. (2014), the BioCreative V Chemical Disease Relation (BC5CDR) task corpus Li et al. (2016), and MedMentions Mohan and Li (2018). We list the number of documents and disease mentions in each dataset in Table 3.
|dataset||# of documents||# of mentions|
NCBI disease corpus consists of 793 PubMed titles and abstracts, each of which contains manually annotated disease mentions. They are manually separated into training (593), development (100), and test (100) sets. Each disease mentions are mapped into the CUI that is contained in the MEDIC dictionary Davis et al. (2012).
BC5CDR is the task of chemical-induced relation extraction comprising 1500 PubMed abstracts. Abstracts are equally separated into training, development, and test sets. BC5CDR provides manually annotated disease and chemical entities, which are mapped into MEDIC and comparative toxicogenomics database (CTD) chemical dictionaries, respectively. We only used disease entities linked to the concepts in MEDIC.
MedMentions is the set of large-scale annotated resources used to recognize the biomedical concepts. It provides more than 4,000 PubMed abstracts and over 350,000 mentions linked to the concepts in unified medical language system (UMLS). For our experiments, we converted concept unique identifiers in UMLS to the entities in MEDIC using UMLS. As UMLS contains many varieties of medical concepts, such as drugs and genes, we filtered the entities that did not refer to diseases. We refer to these as MedMentions-d.
At training, we randomly selected 16 concepts and extracted two sentences for each concept (i.e., 32 sentences for each mini-batch). Models were trained with the temperature parameter set to 2, set to 50, the offset set to 1, and the learning parameter set to 1e-5.
The baseline entity representations used herein are as follows:
Tf-idf: Tf-idf of character level uni-grams and bi-grams. This achieves comparative performance as reported in previous works Sung et al. (2020).
PubMedBERT: The representations of [CLS] from PubMedBERT without fine-tuning.
SapBert: The representations of [CLS] from PubMedBERT trained similar to that in Liu et al. (2021). Note that the size of the dictionary used in training and inference is equal to that used for our model, and is different from the original paper.
In practice, we need to identify disease mentions somehow (e.g. named entity recognition system). As the database constructed with our method is only linked to the concepts in the dictionary, a the gap exists between the input sentence and database. This may cause unexpected attention to the [ENT] tags for the mentions that are not included in the dictionary. One possible method to alleviate this problem is to link only one entity in a sentence. If a sentence has multiple entities, we established linked sentences in which each entity is linked. We trained and evaluated the model in this setting on MedMentions.
Table 4 shows the result. The model using the sentences in which only one entity is linked performs poorly. One possible reason for this is the small number of entities in the mini-batch. Thus, number of negative samples is not enough for training.
When large numbers of documents are used for nearest neighbor search, using all sentences for normalization is memory-intensive. We investigated how we can reduce memory usage for prediction. One possible way is random sampling of the sentences in the database. We evaluated the model with the sentences for each dictionary mention of 1, 5, 10, 20 on MedMentions-d.
Figure 3 shows the results. Surprisingly, we found that the accuracy of the model with random sampling was higher than unity in all the sentences. One possible reason for this is that dictionary names consisting of common words (e.g., cancer and tumor) are linked in many sentences. These sentences provide harmful noise for our model. Regarding the number of sentences, although the model with 20 sentences for each concept performed best, the model can normalize the mentions with a fewer number of sentences. This implies that our model learns robust contextual entity representations.
|Label: D000077273||Input: … histopathology consisted of papillary thyroid carcinoma (PTC) (n = 91, 86.7%).|
|BioCoM||[D000077273] … for detecting cervical lymph node (LN) metastasis in papillary thyroid carcinoma (PTC).|
|Label: D015430||Input: … generations on offspring metabolic traits , including weight and fat gain, …|
|SapBert||[D015430] gain, weight|
|Proposed||[D001835] … more elderly and had lower weights , body mass indices and arm and calf circumferences …|