Log In Sign Up

SMedBERT: A Knowledge-Enhanced Pre-trained Language Model with Structured Semantics for Medical Text Mining

Recently, the performance of Pre-trained Language Models (PLMs) has been significantly improved by injecting knowledge facts to enhance their abilities of language understanding. For medical domains, the background knowledge sources are especially useful, due to the massive medical terms and their complicated relations are difficult to understand in text. In this work, we introduce SMedBERT, a medical PLM trained on large-scale medical corpora, incorporating deep structured semantic knowledge from neighbors of linked-entity.In SMedBERT, the mention-neighbor hybrid attention is proposed to learn heterogeneous-entity information, which infuses the semantic representations of entity types into the homogeneous neighboring entity structure. Apart from knowledge integration as external features, we propose to employ the neighbors of linked-entities in the knowledge graph as additional global contexts of text mentions, allowing them to communicate via shared neighbors, thus enrich their semantic representations. Experiments demonstrate that SMedBERT significantly outperforms strong baselines in various knowledge-intensive Chinese medical tasks. It also improves the performance of other tasks such as question answering, question matching and natural language inference.


page 4

page 5

page 6

page 7

page 8

page 9

page 10

page 13


Learning Conceptual-Contexual Embeddings for Medical Text

External knowledge is often useful for natural language understanding ta...

Exploiting Structured Knowledge in Text via Graph-Guided Representation Learning

In this work, we aim at equipping pre-trained language models with struc...

A Simple but Effective Pluggable Entity Lookup Table for Pre-trained Language Models

Pre-trained language models (PLMs) cannot well recall rich factual knowl...

Does Knowledge Help General NLU? An Empirical Study

It is often observed in knowledge-centric tasks (e.g., common sense ques...

Embracing Ambiguity: Improving Similarity-oriented Tasks with Contextual Synonym Knowledge

Contextual synonym knowledge is crucial for those similarity-oriented ta...

KnowGL: Knowledge Generation and Linking from Text

We propose KnowGL, a tool that allows converting text into structured re...

Enriching Medcial Terminology Knowledge Bases via Pre-trained Language Model and Graph Convolutional Network

Enriching existing medical terminology knowledge bases (KBs) is an impor...

1 Introduction

Pre-trained Language Models (PLMs) learn effective context representations with self-supervised tasks, spotlighting in various NLP tasks (DBLP:conf/aaai/WangKMYTACFMMW19; DBLP:conf/acl/NanGSL20; DBLP:conf/acl/LiuGFYCJLD20). In addition, Knowledge-Enhanced PLMs (KEPLMs) (DBLP:conf/acl/ZhangHLJSL19; DBLP:conf/aaai/LiuZ0WJD020; DBLP:journals/corr/abs-1911-06136) further benefit language understanding by grounding these PLMs with high-quality, human-curated knowledge facts, which are difficult to learn from raw texts.

In the literatures, a majority of KEPLMs (DBLP:journals/corr/abs-2009-02835; DBLP:conf/aaai/HayashiHXN20; DBLP:conf/coling/SunSQGHHZ20) inject information of entities corresponding to mention-spans from Knowledge Graphs (KGs) into contextual representations. However, those KEPLMs only utilize linked-entity in the KGs as auxiliary information, which pay little attention to the neighboring structured semantics information of the entity linked with text mentions. In the medical context, there exist complicated domain knowledge such as relations and medical facts among medical terms (rotmensch2017learning; DBLP:journals/artmed/LiWYWLJSTCWL20), which are difficult to model using previous approaches. To address this issue, we consider leveraging structured semantics knowledge in medical KGs from the two aspects. (1) Rich semantic information from neighboring structures of linked-entities, such as entity types and relations, are highly useful for medical text understanding. As in Figure 1, “新型冠状病毒” (novel coronavirus) can be the cause of many diseases, such as “肺炎” (pneumonia) and “呼吸综合征” (respiratory syndrome). 222Although we focus on Chinese medical PLMs here. The proposed method can be easily adapted to other languages, which is beyond the scope of this work. (2) Additionally, we leverage neighbors of linked-entity as global “contexts” to complement plain-text contexts used in (DBLP:journals/corr/abs-1301-3781; DBLP:conf/emnlp/PenningtonSM14). The structure knowledge contained in neighbouring entities can act as the “knowledge bridge” between mention-spans, facilitating the interaction of different mention representations. Hence, PLMs can learn better representations for rare medical terms.

In this paper, we introduce SMedBERT, a KEPLM pre-trained over large-scale medical corpora and medical KGs. To the best of our knowledge, SMedBERT is the first PLM with structured semantics knowledge injected in the medical domain. Specifically, the contributions of SMedBERT mainly include two modules:

Mention-neighbor Hybrid Attention: We fuse the embeddings of the node and type of linked-entity neighbors into contextual target mention representations. The type-level and node-level attentions help to learn the importance of entity types and the neighbors of linked-entity, respectively, in order to reduce the knowledge noise injected into the model. The type-level attention transforms the homogeneous node-level attention into a heterogeneous learning process of neighboring entities.

Mention-neighbor Context Modeling:

We propose two novel self-supervised learning tasks for promoting interaction between mention-span and corresponding global context, namely masked neighbor modeling and masked mention modeling. The former enriches the representations of “context” neighboring entities based on the well trained “target word” mention-span, while the latter focuses on gathering those information back from neighboring entities to the masked target like low-frequency mention-span which is poorly represented


In the experiments, we compare SMedBERT against various strong baselines, including mainstream KEPLMs pre-trained over our medical resources. The underlying medical NLP tasks include: named entity recognition, relation extraction, question answering, question matching and natural language inference. The results show that SMedBERT consistently outperforms all the baselines on these tasks.

2 Related Work

PLMs in the Open Domain. PLMs have gained much attention recently, proving successful for boosting the performance of various NLP tasks (DBLP:journals/corr/abs-2003-08271)

. Early works on PLMs focus on feature-based approaches to transform words into distributed representations

(DBLP:conf/icml/CollobertW08; DBLP:conf/nips/MikolovSCCD13; DBLP:conf/emnlp/PenningtonSM14; DBLP:conf/naacl/PetersNIGCLZ18). BERT (DBLP:conf/naacl/DevlinCLT19) (as well as its robustly optimized version RoBERTa (DBLP:journals/corr/abs-1907-11692)) employs bidirectional transformer encoders (DBLP:conf/nips/VaswaniSPUJGKP17) and self-supervised tasks to generate context-aware token representations. Further improvement of performances mostly based on the following three types of techniques, including self-supervised tasks (DBLP:journals/tacl/JoshiCLWZL20), transformer encoder architectures (DBLP:conf/nips/YangDYCSL19) and multi-task learning (DBLP:conf/acl/LiuHCG19).

Knowledge-Enhanced PLMs. As existing BERT-like models only learn knowledge from plain corpora, various works have investigated how to incorporate knowledge facts to enhance the language understanding abilities of PLMs. KEPLMs are mainly divided into the following three types. (1) Knowledge-enhanced by Entity Embedding: ERNIE-THU (DBLP:conf/acl/ZhangHLJSL19) and KnowBERT (DBLP:conf/emnlp/PetersNLSJSS19) inject linked-entity as heterogeneous features learned by KG embedding algorithms such as TransE (DBLP:conf/nips/BordesUGWY13). (2) Knowledge-enhanced by Entity Description: E-BERT (DBLP:journals/corr/abs-2009-02835) and KEPLER (DBLP:journals/corr/abs-1911-06136) add extra description text of entities to enhance semantic representation. (3) Knowledge-enhanced by Triplet Sentence: K-BERT (DBLP:conf/aaai/LiuZ0WJD020) and CoLAKE (DBLP:conf/coling/SunSQGHHZ20) convert triplets into sentences and insert them into the training corpora without pre-trained embedding. Previous studies on KG embedding (DBLP:conf/conll/NguyenSQJ16; DBLP:conf/esws/SchlichtkrullKB18) have shown that utilizing the surrounding facts of entity can obtain more informative embedding, which is the focus of our work.

Figure 2: Model overview of SMedBERT. The left part is our model architecture and the right part is the details of our model including hybrid attention network and mention-neighbor context modeling pre-training tasks.

PLMs in the Medical Domain. PLMs in the medical domain can be generally divided into three categories. (1) BioBERT (DBLP:journals/bioinformatics/LeeYKKKSK20), BlueBERT (DBLP:conf/bionlp/PengYL19), SCIBERT (DBLP:conf/emnlp/BeltagyLC19) and ClinicalBert (DBLP:journals/corr/abs-1904-05342) apply continual learning on medical domain texts, such as PubMed abstracts, PMC full-text articles and MIMIC-III clinical notes. (2) PubMedBERT (DBLP:journals/corr/abs-2007-15779) learns weights from scratch using PubMed data to obtain an in-domain vocabulary, alleviating the out-of-vocabulary (OOV) problem. This training paradigm needs the support of large-scale domain data and resources. (3) Some other PLMs use domain self-supervised tasks for pre-training. For example, MC-BERT (DBLP:journals/corr/abs-2008-10813) masks Chinese medical entities and phrases to learn complex structures and concepts. DiseaseBERT (DBLP:conf/emnlp/HeZZCC20) leverages the medical terms and its category as the labels to pre-train the model. In this paper, we utilize both domain corpora and neighboring entity triplets of mentions to enhance the learning of medical language representations.

3 The SMedBERT Model

3.1 Notations and Model Overview

In the PLM, we denote the hidden feature of each token as where is the maximum input sequence length and the total number of pre-training samples as . Let be the set of mention-span in the training corpora. Furthermore, the medical KG consists of the entities set and the relations set . The triplet set is , where is the head entity with relation to the tail entity . The embeddings of entities and relations trained on KG by TransR (DBLP:conf/aaai/LinLSLZ15) are represented as and , respectively. The neighboring entity set recalled from KG by is denoted as where is the threshold of our PEPR algorithm. We denote the number of entities in the KG as

. The dimensions of the hidden representation in PLM and the KG embeddings are

and , respectively.

The main architecture of the our model is shown in Figure 2. SMedBERT mainly includes three components: (1) Top-K entity sorting determine which K neighbour entities to use for each mention. (2) Mention-neighbor hybrid attention aims to infuse the structured semantics knowledge into encoder layers, which includes type attention, node attention and gated position infusion module. (3) Mention-neighbor context modeling includes masked neighbor modeling and masked mention modeling aims to promote mentions to leverage and interact with neighbour entities.

3.2 Top-K Entity Sorting

Previous research shows that simple neighboring entity expansion may induce knowledge noises during PLM training (DBLP:conf/aaai/WangKMYTACFMMW19). In order to recall the most important neighboring entity set from the KG for each mention, we extend the Personalized PageRank (PPR) (ilprints422) algorithm to filter out trivial entities. 333We name our algorithm to be Personalized Entity PageRank, abbreviated as PEPR. Recall that the iterative process in PPR is where is the normalized adjacency matrix, is the damping factor,

is uniformly distributed jump probability vector, and

is the iterative score vector for each entity.

PEPR specifically focuses on learning the weight for the target mention span in each iteration. It assigns the span a higher jump probability 1 in with the remaining as . It also uses the entity frequency to initialize the score vector :


where is the sum of frequencies of all entities. is the frequency of in the corpora. After sorting, we select the top- entity set .

3.3 Mention-neighbor Hybrid Attention

Besides the embeddings of neighboring entities, SMedBERT integrates the type information of medical entities to further enhance semantic representations of mention-span.

3.3.1 Neighboring Entity Type Attention

Different types of neighboring entities may have different impacts. Given a specific mention-span , we compute the neighboring entity type attention. Concretely, we calculate hidden representation of each entity type as . are neighboring entities of with the same type and .


where is the self-attentive pooling (DBLP:conf/iclr/LinFSYXZB17) to generate the mention-span representation and the is the hidden representation of tokens in mention-span trained by PLMs. is obtained by

non-linear activation function GELU

(hendrycks2016gaussian) and the learnable projection matrix . is the LayerNorm function (DBLP:journals/corr/BaKH16). Then, we calculate the each type attention weight using the type representation and the transformed mention-span representation :


where , and . Finally, the neighboring entity type attention weights are obtained by normalizing the attention score among all entity types .

3.3.2 Neighboring Entity Node Attention

Apart from entity type information, different neighboring entities also have different influences. Specifically, we devise the neighboring entity node attention to capture the different semantic influences from neighboring entities to the target mention span and reduce the effect of noises. We calculate the entity node attention using the mention-span representation and neighboring entities representation with entity type as:


where and are the attention weight matrices.

The representations of all neighboring entities in are aggregated to :


where , , . and

are the bias vectors.

is the mention-neighbor representation from hybrid attention module.

3.3.3 Gated Position Infusion

Knowledge-injected representations may divert the texts from its original meanings. We further reduce knowledge noises via gated position infusion:


where , , , . is the span-level infusion representation. “” means concatenation operation. is the final knowledge-injected representation for mention . We generate the output token representation by 444We find that restricting the knowledge infusion position to tokens is helpful to improve performance.:


where , . . “” means element-wise multiplication.

3.4 Mention-neighbor Context Modeling

To fully exploit the structured semantics knowledge in KG, we further introduce two novel self-supervised pre-training tasks, namely Masked Neighbor Modeling (MNeM) and Masked Mention Modeling (MMeM).

3.4.1 Masked Neighbor Modeling

Formally, let be the relation between the mention-span and a neighboring entity :


where is the mention-span hidden features based on the tokens hidden representation . is the relation representation and is a learnable projection matrix. The goal of MNeM is leveraging the structured semantics in surrounding entities while reserving the knowledge of relations between entities. Considering the object functions of skip-gram with negative sampling (SGNS) (DBLP:journals/corr/abs-1301-3781) and score function of TransR (DBLP:conf/aaai/LinLSLZ15):


where the in is the target word of context . is the compatibility function measuring how well the target word is fitted into the context. Inspired by SGNS, following the general energy-based framework (lecun2006tutorial), we treat mention-spans in corpora as “target words”, and neighbors of corresponding entities in KG as “contexts” to provide additional global contexts. We employ the Sampled-Softmax (DBLP:conf/acl/JeanCMB15) as the criterion for the mention-span :


where denotes the triplet , . is the negative triplets , and is negative entity sampled with detailed in Appendix B. To keep the knowledge of relations between entities, we define the compatibility function as:


where is a scale factor. Assuming the norms of both and are 1,we have:


which indicates the proposed is equivalence with . Because needs to be calculated for each , the computation of the score function is costly. Hence, we transform part of the formula as follows:


In this way, we eliminate computation of transforming each . Finally, to compensate the offset introduced by the negative sampling function (DBLP:conf/acl/JeanCMB15), we complement as:


3.4.2 Masked Mention Modeling

In contrast to MNeM, MMeM transfers the semantic information in neighboring entities back to the masked mention .


where is the ground-truth representation of and . is the pre-trained embedding of BERT in our medical corpora. The mention-span representation obtained by our model is . For a sample , the loss of MMeM is calculated via Mean-Squared Error:


where is the set of mentions of sample .

3.5 Training Objective

In SMedBERT, the training objectives mainly consist of three parts, including the self-supervised loss proposed in previous works and the mention-neighbor context modeling loss proposed in our work. Our model can be applied to medical text pre-training directly in different languages as long as high-quality medical KGs can be obtained. The total loss is as follows:


where is the sum of sentence-order prediction (SOP) (DBLP:conf/iclr/LanCGGSS20) and masked language modeling. and

are the hyperparameters.

4 Experiments

4.1 Data Source

Pre-training Data. The pre-training corpora after pre-processing contains 5,937,695 text segments with 3,028,224,412 tokens (4.9 GB). The KGs embedding trained by TransR (DBLP:conf/aaai/LinLSLZ15) on two trusted data sources, including the Symptom-In-Chinese from OpenKG555 and DXY-KG 666 containing 139,572 and 152,508 entities, respectively. The number of triplets in the two KGs are 1,007,818 and 3,764,711. The pre-training corpora and the KGs are further described in Appendix  A.1.

Task Data. We use four large-scale datasets in ChineseBLUE (DBLP:journals/corr/abs-2008-10813) to evaluate our model, which are benchmark of Chinese medical NLP tasks. Additionally, we test models on four datasets from real application scenarios provided by DXY company 777 and CHIP 888, i.e., Named Entity Recognition (DXY-NER), Relation Extraction (DXY-RE, CHIP-RE) and Question Answer (WebMedQA (DBLP:journals/midm/HeFT19)). For other information of the downstream datasets, we refer readers to Appendix  A.2.

4.2 Baselines

In this work, we compare SMedBERT with general PLMs, domain-specific PLMs and KEPLMs with knowledge embedding injected, pre-trained on our Chinese medical corpora:

General PLMs: We use three Chinese BERT-style models, namely BERT-base (DBLP:conf/naacl/DevlinCLT19), BERT-wwm (DBLP:journals/corr/abs-1906-08101) and RoBERTa (DBLP:journals/corr/abs-1907-11692). All the weights are initialized from (DBLP:conf/emnlp/CuiC000H20).

Domain-specific PLMs: As very few PLMs in the Chinese medical domain are available, we consider the following models. MC-BERT (DBLP:journals/corr/abs-2008-10813) is pre-trained over a Chinese medical corpora via masking different granularity tokens. We also pre-train BERT using our corpora, denoted as BioBERT-zh.

KEPLMs: We employ two SOTA KEPLMs continually pre-trained on our medical corpora as our baseline models, including ERNIE-THU (DBLP:conf/acl/ZhangHLJSL19) and KnowBERT (DBLP:conf/emnlp/PetersNLSJSS19). For a fair comparison, KEPLMs use other additional resources rather than the KG embedding are excluded (See Section 2), and all the baseline KEPLMs are injected by the same KG embedding.

The detailed parameter settings and training procedure are in Appendix B.

Model D1 D2 D3
SGNS-char-med 27.21% 27.16% 21.72%
SGNS-word-med 24.64% 24.95% 20.37%
GLOVE-char-med 27.24% 27.12% 21.91%
GLOVE-word-med 24.41% 23.89% 20.56%
BERT-open 29.79% 29.41% 21.83%
BERT-wwm-open 29.75% 29.55% 21.97%
RoBERTa-open 30.84% 30.56% 21.98%
MC-BERT 30.63% 30.34% 22.65%
BioBERT-zh 30.84% 30.69% 22.71%
ERNIE-med 30.97% 30.78% 22.99%
KnowBERT-med 30.95% 30.77% 23.07%
SMedBERT 31.81% 32.14% 24.08%
Table 1: Results of unsupervised semantic similarity task. “med” refers to models continually pre-trained on medical corpora, and “open” means open-domain corpora. “char’ and “word” refer to the token granularity of input samples.
Named Entity Recognition Relation Extraction
Model cMedQANER DXY-NER Average CHIP-RE DXY-RE Average
Dev Test Dev Test Test Test Dev Test Test
BERT-open 80.69% 83.12% 79.12% 79.03% 81.08% 85.86% 94.18% 94.13% 90.00%
BERT-wwm-open 80.52% 83.07% 79.48% 79.29% 81.18% 86.01% 94.35% 94.38% 90.20%
RoBERT-open 80.92% 83.29% 79.27% 79.33% 81.31% 86.19% 94.64% 94.66% 90.43%
BioBERT-zh 80.72% 83.38% 79.52% 79.45% 81.42% 86.12% 94.54% 94.64% 90.38%
MC-BERT 81.02% 83.46% 79.79% 79.59% 81.53% 86.09% 94.74% 94.73% 90.41%
KnowBERT-med 81.29% 83.75% 80.86% 80.44% 82.10% 86.27% 95.05% 94.97% 90.62%
ERNIE-med 81.22% 83.87% 80.82% 80.87% 82.37% 86.25% 94.98% 94.91% 90.58%
SMedBERT 82.23% 84.75% 83.06% 82.94% 83.85% 86.95% 95.73% 95.89% 91.42%
Table 2: Performance of Named Entity Recognition (NER) and Relation Extraction (RE) tasks in terms of F1. The Development data of CHIP-RE is unreleased in public dataset.
Question Answering Question Matching Natural Lang. Infer.
Model cMedQA WebMedQA Average cMedQQ cMedNLI
Dev Test Dev Test Test Dev Test Dev Test
BERT-open 72.99% 73.82% 77.20% 79.72% 76.77% 86.74% 86.72% 95.52% 95.66%
BERT-wwm-open 72.03% 72.96% 77.06% 79.68% 76.32% 86.98% 86.82% 95.53% 95.78%
RoBERT-open 72.22% 73.18% 77.18% 79.57% 76.38% 87.24% 86.97% 95.87% 96.11%
BioBERT-zh 74.32% 75.12% 78.04% 80.45% 77.79% 87.30% 87.06% 95.89% 96.04%
MC-BERT 74.40% 74.46% 77.85% 80.54% 77.50% 87.17% 87.01% 95.81% 96.06%
KnowBERT-med 74.38% 75.25% 78.20% 80.67% 77.96% 87.25% 87.14% 95.96% 96.03%
ERNIE-med 74.37% 75.22% 77.93% 80.56% 77.89% 87.34% 87.20% 96.02% 96.25%
SMedBERT 75.06% 76.04% 79.26% 81.68% 78.86% 88.13% 88.09% 96.64% 96.88%
Table 3: Performance of Question Answering (QA), Question Matching (QM) and Natural Language Inference (NLI) tasks. The metric of the QA task is Acc@1 and those of QM and NLI are F1.

4.3 Intrinsic Evaluation

To evaluate the semantic representation ability of SMedBERT, we design an unsupervised semantic similarity task. Specifically, we extract all entities pairs with equivalence relations in KGs as positive pairs. For each positive pair, we use one of the entity as query entity while the other as positive candidate, which is used to sample other entities as negative candidates. We denote this dataset as D1. Besides, the entities in the same positive pair often have many neighbours in common. We select positive pairs with large proportions of common neighbours as D2. Additionally, to verify the ability of SMedBERT of enhancing the low-frequency mention representation, we extract all positive pairs that with at least one low-frequency mention as D3. There are totally 359,358, 272,320 and 41,583 samples for D1, D2, D3 respectively. We describe the details of collecting data and embedding words in Appendix C. In this experiments, we compare SMedBERT with three types of models: classical word embedding methods (SGNS (DBLP:journals/corr/abs-1301-3781), GLOVE (DBLP:conf/emnlp/PenningtonSM14)

), PLMs and KEPLMs. We compute the similarity between the representation of query entities and all the other entities, retrieving the most similar one. The evaluation metric is top-1 accuracy (Acc@1).

Experiment results are shown in Table 1. From the results, we observe that: (1) SMedBERT greatly outperforms all baselines especially on the dataset D2 (+1.36%), where most positive pairs have many shared neighbours, demonstrating that ability of SMedBERT to utilize semantic information from the global context. (2) In dataset D3, SMedBERT improve the performance significantly (+1.01%), indicating our model is effective to enhance the representation of low-frequency mentions.

4.4 Results of Downstream Tasks

We first evaluate our model in NER and RE tasks that are closely related to entities in the input texts. Table 2 shows the performances on medical NER and RE tasks. In NER and RE tasks, we can observe from the results: (1) Compared with PLMs trained in open-domain corpora, KEPLMs with medical corpora and knowledge facts achieve better results. (2) The performance of SMedBERT is greatly improved compared with the strongest baseline in two NER datasets (+0.88%, +2.07%), and (+0.68%, +0.92%) on RE tasks. We also evaluate SMedBERT on QA, QM and NLI tasks and the performance is shown in Table 3. We can observe that SMedBERT improve the performance consistently on these datasets (+0.90% on QA, +0.89% on QM and +0.63% on NLI). In general, it can be seen from Table 2 and Table 3 that injecting the domain knowledge especially the structured semantics knowledge can improve the result greatly.

4.5 Influence of Entity Hit Ratio

In this experiment, we explore the model performance in NER and RE tasks with different entity hit ratios, which control the proportions of knowledge-enhanced mention-spans in the samples. The average number of mention-spans in samples is about 40. Figure LABEL:entity_hit_ratio illustrates the performance of SMedBERT and ERNIE-med (DBLP:conf/acl/ZhangHLJSL19). From the result, we can observe that: (1) The performance improves significantly at the beginning and then keeps stable as the hit ratio increases, proving the heterogeneous knowledge is beneficial to improve the ability of language understanding and indicating too much knowledge facts are unhelpful to further improve model performance due to the knowledge noise (DBLP:conf/aaai/LiuZ0WJD020). (2) Compared with previous approaches, our SMedBERT model improves performance greatly and more stable.

Figure 4: The influence of different K values in results.

4.6 Influence of Neighboring Entity Number

We further evaluate the model performance under different over the test set of DXY-NER and DXY-RE. Figure 4 shows the the model result with . In our settings, the SMedBERT can achieve the best performance in different tasks around . The results of SMedBERT show that the model performance increasing first and then decreasing with the increasing of . This phenomenon also indicates the knowledge noise problem that injecting too much knowledge of neighboring entities may hurt the performance.

4.7 Ablation Study

In Table LABEL:ablation_study, we choose three important model components for our ablation study and report the test set performance on four datasets of NER and RE tasks that are closely related to entities. Specifically, the three model components are neighboring entity type attention, the whole hybrid attention module, and mention-neighbor context modeling respectively, which includes two masked language model loss and .

From the result, we can observe that: (1) Without any of the three mechanisms, our model performance can also perform competitively with the strong baseline ERNIE-med (DBLP:conf/acl/ZhangHLJSL19). (2) Note that after removing the hybrid attention module, the performance of our model has the greatest decline, which indicates that injecting rich heterogeneous knowledge of neighboring entities is effective.

5 Conclusion

In this work, we address medical text mining tasks with the structured semantics KEPLM proposed named SMedBERT. Accordingly, we inject entity type semantic information of neighboring entities into node attention mechanism via heterogeneous feature learning process. Moreover, we treat the neighboring entity structures as additional global contexts to predict the masked candidate entities based on mention-spans and vice versa. The experimental results show the significant improvement of our model on various medical NLP tasks and the intrinsic evaluation. There are two research directions that can be further explored: (1) Injecting deeper knowledge by using “farther neighboring” entities as contexts; (2) Further enhancing Chinese medical long-tail entity semantic representation.


We would like to thank anonymous reviewers for their valuable comments. This work is supported by the National Key Research and Development Program of China under Grant No. 2016YFB1000904, and Alibaba Group through Alibaba Research Intern Program.


Appendix A Data Source

a.1 Pre-training Data

a.1.1 Training Corpora

The pre-training corpora is crawled from DXY BBS (Bulletin Board System) 999, which is a very popular Chinese social network for doctors, medical institutions, life scientists, and medical practitioners. The BBS has more than 30 channels, which contains 18 forums and 130 fine-grained groups, covering most of the medical domains. For our pre-training purpose, we crawl texts from channels about clinical medicine, pharmacology, public health and consulting. For text pre-processing, we mainly follow the methods of (DBLP:journals/corr/abs-2003-01355). Additionally, (1) we remove all URLs, HTML tags, e-mail addresses, and all tokens except characters, digits, and punctuation (2) all documents shorter than 256 are discard, while documents longer than 512 are cut into shorter text segments.

a.1.2 Knowledge Graph

The DXY knowledge graph is construed by extracting structured text from DXY website101010, which includes information of diseases, drugs and hospitals edited by certified medical experts, thus the quality of the KG is guaranteed. The KG is mainly disease-centered, including totally 3,764,711 triples, 152.508 unique entities, and 44 types of relations. The details of Symptom-In-Chinese from OpenKG is available 111111 We finally get 26 types of entities, 274,163 unique entities, 56 types of relations, and 4,390,726 triples after the fusion of the two KGs.

a.2 Task Data

We choose the four large-scale datasets in ChineseBlue tasks (DBLP:journals/corr/abs-2008-10813) while others are ignored due to the limitation of datasets size, which are cMedQANER, cMedQQ, cMedQNLI and cMedQA. WebMedQA (DBLP:journals/midm/HeFT19) is a real-world Chinese medical question answering dataset and CHIP-RE dataset are collected from online health consultancy websites. Note that since both the WebMedQA and cMedQA datasets are very large while we have many baselines to be compared, we randomly sample the official training set, development set and test set respectively to form their corresponding smaller version for experiments. DXY-NER and DXY-RE are datasets from real medical application scenarios provided by a prestigious Chinese medical company. The DXY-NER contains 22 unique entity types and 56 relation types in the DXY-RE. These two datasets are collected from the medical forum of DXY and books in the medical domain. Annotators are selected from junior and senior students with clinical medical background. In the process of quality control, the two datasets are annotated twice by different groups of annotators. An expert with medical background performs quality check manually again when annotated results are inconsistent, whereas perform sampling quality check when results are consistent. Table 5 shows the datasets size of our experiments.

The Dataset Size in Our Experiments
Dataset Train Dev Test Task Metric
cMedQANER (DBLP:journals/corr/abs-2008-10813) 1,673 175 215 NER F1
cMedQQ (DBLP:journals/corr/abs-2008-10813) 16,071 1,793 1,935 QM F1
cMedQNLI (DBLP:journals/corr/abs-2008-10813) 80,950 9,065 9,969 NLI F1
cMedQA (zhang2017chinese) 186,771 46,600 46,600 QA Acc@1
WebMedQA (DBLP:journals/midm/HeFT19) 252,850 31,605 31,655 QA Acc@1
CHIP-RE 43,649 - 10,622 RE F1
DXY-NER 34,224 8,576 8,592 NER F1
DXY-RE 141,696 35,456 35,794 RE F1

CHIP-RE dataset is released in CHIP 2020. (

Table 5: The statistical data and metric of eight datasets used in our SMedBERT model.

Appendix B Model Settings and Training Details


=768, =200, =10, =10, =2, =4.

Model Details.

We align the all mention-spans to the entity in KG by exact match for comparison purpose with ENIRE-THU (DBLP:conf/acl/ZhangHLJSL19). The negative sampling function is defined as , where is the sum of frequency of all mentions with the same type of . The Mention-neighbor Hybrid Attention module is inserted after the tenth transformer encoder layer to compare with KnowBERT (DBLP:conf/emnlp/PetersNLSJSS19), while we perform the Mention-neighbor Context Modeling based on the output of BERT encoder. We use all the base-version PLMs in the experiments. The size of SMedBERT is 474MB while 393MB of that are components of BERT, and the added 81MB is mostly of the KG embedding. Results are presented in average with 5 random runs with different random seeds and the same hyper-parameters.

Training Procedure.

We strictly follow the originally pre-training process and parameter setting of other KEPLMs. We only adapt their publicly available code from English to Chinese and use the knowledge embedding trained on our medical KG. To have a fair comparison, the pre-training processing of SMedBERT is mostly set based on ENIRE-THU (DBLP:conf/acl/ZhangHLJSL19) without layer-special learning rates in KnowBERT (DBLP:conf/emnlp/PetersNLSJSS19)

. We only pre-train SMedBERT on the collected medical data for 1 epoch. In pre-training process, the learning rate is set to

and batch size is 512 with the max sequence length is 512. For fine-tuning, we find the following ranges of possible values work well, i.e., batch size is {8,16}, learning rate (AdamW) is {, , } and the number of epochs is {2,3,4}. Pre-training SMedBERT takes about 36 hours per epoch on 2 NVIDIA GeForce RTX 3090 GPUs.

Appendix C Data and Embedding of Unsupervised Semantic Similarity

Since the KGs used in this paper is a directed graph, we first transform the directed ”等价关系” (equivalence relations) pairs to undirected pairs and discard the duplicated pairs. For each positive pairs, we use head and tail as query respectively and sample the negative candidates based on the other. Specifically, we randomly select 19 negative entities with the same type and has a Jaro-Winkle similarity (winkler1990string) bigger 0.6 with the ground-truth entity. We select from all samples in Dataset-1

with positive pairs that the neighbours sets of head and tail entity have Jaccard Index

(2010THE) no less than 0.75 and at least 3 common element to construct the Dataset-2. For Dataset-3, we count the frequency of all entity mentions in pre-training corpora, and treat mentions with frequency no more than 200 as low-frequency mentions.

Classic Word Representation Embedding:

We train the character-level and word-level embedding using SGNS (DBLP:journals/corr/abs-1301-3781) and GLOVE (DBLP:conf/emnlp/PenningtonSM14)

model respectively on our medical corpora with open-source toolkits

. We average the character embedding for all tokens in the mention to get the character-level representation. However, since some mentions are very rare in the corpora for word-level representation, we use the character-level representation as their word-level representation.

BERT-like Representation Embedding:

We extract the token hidden features of the last layer and average the representations of the input tokens except [CLS] and [SEP] tag, to get a vector for each entity.

Similarity Measure:

We try using the inverse of L2-distance and cosine similarity as measurement, and we find that cosine similarity always perform better. Hence, we report all experiment results under the cosine similarity metric.