A Benchmark Dataset for Understandable Medical Language Translation

by   Junyu Luo, et al.

In this paper, we introduce MedLane – a new human-annotated Medical Language translation dataset, to align professional medical sentences with layperson-understandable expressions. The dataset contains 12,801 training samples, 1,015 validation samples, and 1,016 testing samples. We then evaluate one naive and six deep learning-based approaches on the MedLane dataset, including directly copying, a statistical machine translation approach Moses, four neural machine translation approaches (i.e., the proposed PMBERT-MT model, Seq2Seq and its two variants), and a modified text summarization model PointerNet. To compare the results, we utilize eleven metrics, including three new measures specifically designed for this task. Finally, we discuss the limitations of MedLane and baselines, and point out possible research directions for this task.


page 1

page 2

page 3

page 4


Machine Translation : From Statistical to modern Deep-learning practices

Machine translation (MT) is an area of study in Natural Language process...

Towards Machine Translation for the Kurdish Language

Machine translation is the task of translating texts from one language t...

Neural-based machine translation for medical text domain. Based on European Medicines Agency leaflet texts

The quality of machine translation is rapidly evolving. Today one can fi...

Towards End-to-End In-Image Neural Machine Translation

In this paper, we offer a preliminary investigation into the task of in-...

PharmMT: A Neural Machine Translation Approach to Simplify Prescription Directions

The language used by physicians and health professionals in prescription...

Uncertainty-Aware Machine Translation Evaluation

Several neural-based metrics have been recently proposed to evaluate mac...

SemMT: A Semantic-based Testing Approach for Machine Translation Systems

Machine translation has wide applications in daily life. In mission-crit...

1 Introduction

The increasing accessibility of healthcare data makes patients understand their current health conditions more convenient, which may improve the effectiveness of communication between patients and clinicians. However, many notes and records are usually written in professional clinical jargon and abbreviations to achieve efficiency and preciseness, which are difficult to be understood for patients without medical knowledge and clinical experience. To fill such a gap, clinicians have to be responsible for translating the professional medical information to patients in plain languages. Obviously, it occupies lots of valuable medical resources and further increases burdens for clinicians. However, the translating work cannot be avoided because the lack of doctor-patient communication may lead to a tense doctor-patient relationship Ha and Longnecker (2010). Therefore, there is a great need for automatically translating professional clinical language to layperson-understandable language.

Towards this aim, researchers have proposed to use expert-curated dictionaries Kandula et al. (2010); Zeng and Tse (2006); Zeng-Treitler et al. (2007)or pattern-based techniques Vydiswaran et al. (2014) to achieve the auto-translation. However, creating and managing an up-to-date professional dictionary is not only labor-intensive but also needs the participation of domain experts. Pattern-based techniques require the high-quality of input data, which significantly limits their applicability. Recently, researchers cast the medical language understanding task as a neural machine translation task Weng et al. (2019). Due to the lack of aligned sentences, the existing study takes it as an unsupervised translation. However, without any labeled information and sufficient training data, it is extremely hard for unsupervised approaches to learn an accurate, usable, and reliable translation model.

To address the aforementioned challenges, in this paper, we introduce a human-annotated dataset, named MedLane, which aligns professional-to-customer sentences extracted from clinical notes in the MIMIC-III v1.4 database Johnson et al. (2016). We evaluate seven approaches on this dataset with eleven metrics to show the usability of MedLane. To sum up, our major contributions are listed as follows:

  • A New Dataset. To the best of our knowledge, MedLane is the first new, publicly available, human-annotated dataset for dealing with the understandable medical language translation task (UMLT), which consists of 12,801 training samples, 1,015 validation samples, and 1,016 testing samples.

  • Novel Measurements

    . Different from bilingual translation tasks, our task is to translate professional medical jargon to layperson-understandable language. We not only require the translated results readable but also accurate and easily understandable. Thus, we design three new yet general evaluation metrics, which can also be used for other tasks such as medical report generation.

  • Strong Baselines. We propose a straightforward but effective solution, named PMBERT-MT, which is built upon the pre-trained language model PubMedBERT Gu et al. (2020). Besides, we implement six other baselines, including directly copying, a statistical machine translation approach Moses111http://www.statmt.org/moses/, three neural machine translation approaches (i.e., Seq2Seq Bahdanau et al. (2014) and its two variants), and a modified text summarization model PointerNet Vinyals et al. (2015), for validating the usability of the MedLane dataset.

  • Comprehensive Evaluation. We conduct comprehensive experiments by running seven approaches, evaluating on eleven metrics including eight traditional machine translation metrics and three new designed ones, and analyzing their performance in terms of both quality and quantity.

2 Related Work

2.0.1 Medical Language Translation

There are mainly three types of approaches for making professional medical language accessible to ordinary customers: dictionary-based, pattern-based and neural machine translation. All approaches are unsupervised, mainly due to lacking of the aligned training data.

The dictionary-based approaches rely on the expert-curated dictionary for transferring professional words Kandula et al. (2010); Zeng and Tse (2006); Zeng-Treitler et al. (2007). However, they are over-mechanized and cannot deal with the case of polysemant, i.e., a term can refer to different explanations according to the context.

The pattern-based approaches aim to identify pairs of professional and consumer terms from a large corpus like Wikipedia Vydiswaran et al. (2014). Although they perform better than dictionary-based models, their performance highly depends on the quality of the input data and the existence of related corpus.

The neural machine translation models have been leveraged mainly for word embedding alignment Kang et al. (2016); Weng and Szolovits (2018), and especially focusing on unsupervised machine translation Weng et al. (2019). Compared with previous methods, neural machine translation models can generate sentences with better readability and deal with the polysemant issue. However, it is extremely hard for unsupervised approaches to learn an accurate, usable, and reliable translation model without any labeled information and sufficient training data.

2.0.2 Medical Natural Language Processing Datasets

Below we review existing medical natural language datasets. We omit the non-English ones such as the UFAL Medical Corpus v.1.0222https://ufal.mff.cuni.cz/ufal_medical_corpus and only focus on the English based datasets.

The MIMIC-III database Johnson et al. (2016) contains de-identified data from 58,976 ICU patient admissions, which includes several types of medical information such as demographics, medications, clinical notes, and so on. Most existing works mainly leverage medical codes and biosignals for disease detection Ma et al. (2018), readmission prediction Du et al. (2020), mortality prediction Caicedo-Torres and Gutierrez (2019). There are some studies focusing on using the clinical notes for automated ICD coding Li and Yu (2020); Xie and Xing (2018); Cao et al. (2020), natural language inference Shivade (2018) and unsupervised medical language translation Weng et al. (2019).

National NLP Clinical Challenges (n2c2)333https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ (the original i2b2 NLP datasets) provide unstructured notes from the Research Patient Data Repository at Partners Healthcare System in Boston. The challenges include clinical trial cohort selection, adverse drug events and medication extraction, clinical concept normalization, clinical semantic textual similarity, etc., which are totally different from our task.

Though the MIMIC-III and n2c2 datasets contain clinical notes, none of them provides publicly available human-annotated and aligned data used for translating professional clinical jargon to layperson-understandable language. Thus, the MedLane dataset is unique yet necessary.

3 The MedLane Dataset

There are two key steps when we create the MedLane dataset. The first step is to recognize suitable professional medical sentences (named source sentences). The second is to translate those sentences to the aligned plain customer understandable sentences by annotators, which are further checked by experts. We will illustrate the details of each step in the following sub-sections.

3.1 Source Sentence Collection

3.1.1 Section Selection

We extract professional clinical notes from the NOTEEVENTS table of the MIMIC-III v1.4 dataset444https://mimic.physionet.org/mimictables/noteevents/. We mainly focus on the following three sections: (1) History of present illness, (2) Brief summary of hospital course, and (3) Brief hospital course. These three sections contain thoughts and reasoning for the communication between clinicians, which are usually written with professional medical jargon. We first recognize those sections from the whole clinical notes using the regular expression technique and then select candidate sentences.

3.1.2 Sentence Selection

Although most sentences in the three selected sections contain many professional clinical jargon words, there are still some sentences written with plain language. For example, this is a sentence extracted from one of the MIMIC-III clinical notes, which is “She now also reports of being hunger.”

. For such sentences, we do not need to translate them again. To filter those sentences, we design a heuristic feature-based sentence selection algorithm.

We first tokenize each sentence into a set of words and then use a dictionary-based approach to identity medical abbreviations. We also count the number of commonly-used/top-3000 English words555https://www.ef.com/wwen/english-resources/english-vocabulary/top-3000-words/. Based on the length of the sentences, the number of medical abbreviations, and the number of top-3000 words, we can automatically select candidate sentences.

0:  Target sentence , top-3000 word set , medical abbreviation set
1:  Tokenize into words
2:  for  to  do
3:     if  then
4:          = + 1;
5:     end if
6:     ;
7:     if  then
8:         ;
9:     end if
10:  end for
11:  if  or or  then
12:     return False;
13:  else
14:     return True;
15:  end if
Algorithm 1 Sentence Selection Algorithm

The sentence selection algorithm is summarized in Algorithm 1. For each tokenized word , we first check whether it belongs to the medical abbreviation set , and in line 4 denotes the current number of medical abbreviations in the target sentence . If is not a medical abbreviation, then we will check whether the lemmatized belongs to the top-3000 word set . in line 8 represents the current number of words that are neither medical abbreviations nor commonly-used words. Finally, based on the predefined criteria, the algorithm can automatically decide to keep or remove the target sentence.

3.2 Data Annotation

After we collect a set of source sentences, the next step is to annotate them. However, annotating medical sentences is different from creating a parallel bilingual translation corpus. The medical language translation task aims to “translate” professional and clinical jargon to layperson-understandable language, which is still from the same language but using different expressions. Besides, annotating bilingual translation data focuses on readability and accuracy. Except for those two perspectives, annotating medical sentences also considers understandability, which is from the perspective of patients or customers. Based on the above guidance, we hire workers to translate the source sentences.

The purpose of this work is to create a benchmark for the understandable medical language translation (UMLT) task, which is not only used for training translation models but also for fairly evaluating different approaches. Thus, we set different requirements for workers when annotating the training data and validation/testing data. In general, they use two steps when annotating each source sentence. The first step is to paraphrase the abbreviations with the whole words or phrases. For each abbreviation, there may be several full forms. For example, “TLC” has two full forms666https://medical-dictionary.thefreedictionary.com/TLC. One is “thin-layer chromatography”, and the other is “total lung capacity”. Therefore, it is important for workers to understand the context in which the abbreviation or term has been used. The second step is to use simple words to replace professorial medical expressions. Take the word “hematocrit777https://www.mayoclinic.org/tests-procedures/hematocrit/about/pac-20384728 as an example, which means the ratio of the volume of red blood cells to the volume of the whole blood. If we use the expression, “the proportion of red blood cells in the blood”, it is much understandable for patients compared with directly using professional clinical jargon. An example in Figure 1 illustrates the annotating procedure.

When annotating the training data, workers are asked to return the final understandable sentences, i.e., the simplified ones, which will be checked by experts again to guarantee the annotation quality. For validation and testing data, we require workers to return both rephrasing and simplifying forms for each source sentence.

Figure 1: An example of annotating a source sentence by a work using two steps, i.e., rephrasing and simplifying. In the rephrasing step, three abbreviations are replaced by full forms. In the simplifying step, the full form “nasal cannula” is replaced by “tube insertion on nose”.

Note that for all the training, validation, and testing data, there is a special case that we do not need to translate the source sentence. For example, it is easy to understand the sentence “He had a set of surveillance blood cultures drawn last week, which were negative.”. These sentences are extremely useful when training an understandable translation model because they can be considered as important indicators for guiding model learning. In the validation and testing data, another special case is that the sentence may not be simplified any more. For example, the source sentence is “She also had subjective SOB with CXR suggesting fluid overload.”, and the rephrased and simplified sentences are the same, which is “She also had subjective [shortness of breath] with [chest x-ray] suggesting fluid overload.”.

# of tokens in the source sentences 14,780
# of tokens in the target sentences 14,278
# of overlapped tokens between source and target 12,501
Avg. length of the source sentences 20.6
Avg. length of the target sentences 24.0
Avg. # of abbreviations in validation and testing data 1.2
Table 1: MedLane data statistics.

3.3 Dataset Statistics

The MedLane dataset contains 12,801 sentences for training, 1,015 sentences for validation, and 1,016 sentences for testing. Table 1 shows the statistical characteristics of the MedLane dataset. Compared with traditional machine translation datasets, MedLane is small. However, our dataset is different from those datasets. First, the way of annotation is different as we discussed in the previous section. Second, there are a large number of overlapped tokens between source and target sentences, which is also different from traditional machine translation. These differences make our task unique and challenging.

4 Approaches

In this section, we present an easy and straightforward solution (PMBERT-MT) for solving the challenges of the understandable medical language translation task. Besides, six baseline approaches are introduced to evaluate the feasibility and utility of the annotated MedLane dataset.

4.1 The PMBERT-MT Model

The pre-trained model BERT Devlin et al. (2018) has shown its superior effectiveness to improve the performance of different downstream applications, such as question answering Reddy et al. (2019), natural language understanding Liu et al. (2019)

, and image captioning 

Yang et al. (2019). In the medical domain, there are several models developed based on BERT, including ClinicalBERT Alsentzer et al. (2019), BioBERT Lee et al. (2020), and PubMedBERT Gu et al. (2020). Among these three models, PubMedBERT can achieve the best performance according to the BLURB leaderboard888https://microsoft.github.io/BLURB/leaderboard.html. Thus, we utilize PubMedBERT as the pre-trained model to conduct the translation task, and the designed PMBERT-MT model is shown in Figure 2.

Figure 2: Overview of the proposed PMBERT-MT model. is a special token used in the PMBERT-MT model to represent the global information of a sentence.

Input & Output. Let denote the tokenized professional medical sentence . For each sentence, a special word SOS is added to indicate the beginning of sentence . Also, an ending word EOS is appended to each ground truth sentence . Thus, the input of the PMBERT-MT model is SOS, , and the corresponding ground truth is EOS.


. We first learn an embedding vector for each word

SOS, , i.e., where is the PubMedBERT embedding layer. Then, a transformer unit Vaswani et al. (2017) is employed to learn the representation of each input word using: where is the embedding of SOS. is the representation of SOS, which is also considered as the context vector of the input sentence.

Decoder. The decoder aims to generate the target sentence based on both the input source sentence and the outputs of the encoder. In particular, we utilize an attention-based LSTM model as the decoder. The decoder is often trained to generate the -th word given the global representation vector and all the generated words from to . Let represent the hidden state outputted by the decoder LSTM, which is denoted as:

Then we can obtain the weighted context vector using the attention mechanism as follows:

Finally, we can obtain the probability of the

-th word: where , and are two parameters.

4.2 Baselines

Naive Translation Approach. As we discussed in the Data Annotation Section, the understandable medical language translation is different from the traditional bilingual translation task, which aims to make professional medical jargon understandable to patients or customers. Besides, even though some sentences contain a few medical words, they may not require to be translated. Thus, directly copying (Copy) can be considered as a naive baseline.

Statistical Machine Translation. We select a widely-used statistical machine translation (SMT) system Moses999http://www.statmt.org/moses/ as a baseline. SMT methods Koehn et al. (2007) follow the same principle, i.e., finding the maximum probability models from learning the aligned corpus, but less flexible, because it is limited by statistical -gram language model Marino et al. (2006). Thus, the learned word relations are generally limited within five words. However, the required learning parameters of SMT models are less compared with advanced neural machine translation methods, which may give them extra advantages in dealing with small datasets.

Neural Machine Translation

. With the recent fast development of deep neural models and computing ability, neural machine translation has become more and more popular in the natural language processing area. The adopted PMBERT-MT model belongs to this category. Besides, we use the representative neural machine translation model

Seq2Seq Bahdanau et al. (2014) as a baseline. Different from PMBERT-MT, the encoder of Seq2Seq is a bidirectional LSTM model, but the network structure of the decoder is the same.

In addition, we use two variants of Seq2Seq as baselines. One is the reduced version of Seq2Seq denoted as Seq2Seq, which does not use the attention mechanism in the decoder. The other is denoted as Seq2Seq-S, which tries to evaluate the idea of sharing the embedding space of encoder and decoder models. In other words, they use the same parameter to embed words. The remaining components of both Seq2Seq and Seq2Seq-S are the same as those of Seq2Seq.

Text Summarization Model. For the medical language translation task, many words can be directly referred from the source sentences. Thus, the setting is very similar to the text summarization task Narayan et al. (2017), where the abstract is generated by selecting proper words from the original text. The difference is that for our task, some new words may need to be added. Towards this end, we simply redesign the pointer network Vinyals et al. (2015), called PointerNet, by adding a generating/referring option to the model. The model first decides which mode to use for the current output. In the referring mode, the model is acting as a general pointer network. However, in the generating mode, the model is acting like a normal Seq2Seq model.

5 Experiments

In this section, we conduct experiments on the MedLane dataset to evaluate the performance of different baselines. We first introduce the experimental settings for baselines, then describe evaluation metrics, and finally, analyze the experimental results.

5.1 Experimental Settings

For the statistical model Moses, we follow the training procedure listed on the User Manual and Code Guide file101010http://www.statmt.org/moses/manual/manual.pdf. For neural machine translation models and text summarization baseline, we all conduct a grid search to find the optimal parameters. Finally, for Seq2Seq, Seq2Seq, Seq2Seq-S, and PointerNet, the hidden size is set to 256 for both encoder and decoder by greedy search, the learning rate is set to . We use Adam Kingma and Ba (2014) as the optimizer. Tokenization is performed using NLTK word tokenizer Bird et al. (2009). Early stop is also applied by checking the BLEU score Papineni et al. (2002) on the validation set, and the training batch size is set to 30.

Since PMBERT-MT is developed based on PubMedBERT, the hidden size is the same as that of PubMedBERT, which is 786. We also use the default PubMedBERT Adam optimizer with the learning rate as , the warm-up method, the default PubMedBERT vocabulary, and tokenization are applied. In the evaluation stage, the same NLTK word tokenizer is applied as baselines to break the sentences into words for calculating the scores for a fair comparison. Early stop is also applied with the same method, and the training batch size is set to 30. All models are trained on Ubuntu 16.04 with 128GB memory and an Nvidia Tesla P100 GPU.

5.2 Evaluation Metrics

5.2.1 Existing Machine Translation Metrics

BLEU Papineni et al. (2002) is an automatic algorithm for evaluating the machine-translated results. BLEU focuses on the word similarity between the target and candidate sentences, which is helpful in general machine translation task. In our experiments, we report BLEU-1, BLEU-2, BLEU-3, BLEU-4, and the averaged BLEU score (BLEU for short). Besides, METEOR Lavie and Agarwal (2007), ROUGE-L Lin (2004), CIDEr Vedantam et al. (2015) are also used as the evaluation metrics.

Source: vascular saw the pt and did not feel that there was an acute need for an invasive procedure . Target: vascular saw the patient and did not feel that there was an acute need for an invasive procedure .
Figure 3: Example of the failure of existing metrics.
Copy 0.7898 0.7495 0.7147 0.6826 0.7342 0.4601 0.8496 5.9432 0.5400 0.7001 0.6323
Moses 0.7880 0.7130 0.6530 0.6016 0.6889 0.4237 0.8188 5.1046 0.6823 0.7543 0.6859
Seq2seq 0.7136 0.6322 0.5969 0.5160 0.6147 0.3533 0.7609 4.1299 0.7388 0.7980 0.6648
Seq2seq- 0.5066 0.3315 0.2373 0.1787 0.3135 0.1859 0.4948 1.2670 0.6427 0.8367 0.4070
Seq2seq-S 0.7180 0.6386 0.5778 0.5267 0.6153 0.3604 0.7683 4.2635 0.7331 0.7953 0.6630
PointerNet 0.6870 0.5904 0.5158 0.4541 0.5618 0.3338 0.7285 3.9458 0.6414 0.7555 0.5949
PMBERT-MT 0.8003 0.7428 0.6952 0.6531 0.7228 0.4566 0.8218 5.3293 0.7808 0.7358 0.7477
Rank 1 2 2 2 2 2 2 2 1 6 1
Table 2: Performance evaluation of all the baselines with eleven metrics.

5.2.2 Task-specific Metrics

Since our task is different from traditional machine translation tasks, directly applying existing evaluation metrics cannot fairly evaluate the performance of different models. Here, we use an example in Figure 3 to demonstrate the failure of existing evaluation metrics, such as BLEU score. If we directly copy the original source sentence as the answer, a very high BLEU score can be obtained, which is 0.85. However, the critical term “pt” is not translated. Without translating this term, patients or customers may not totally understand the meaning of this sentence. Thus, it is necessary to design task-specific metrics.

  • Hit Ration (HIT). A key challenge of medical language translation is to translate professional medical jargon to layperson-understandable words. Let denote the number of professional medical terms in a source sentence and be the number of correctly translated terms in the target. We then have the HIT ration, which is .

  • Common Word Ration (CWR). To evaluate the simplicity of the translated sentences, we calculate the common word ratio for each output sentence. We first lemmatize each word of the output. If the lemmatized form is in the top-3000 commonly-used English words, then it is a common word. Otherwise, it is not a common word. Let denote the number of common words in the translated sentence and represents the length of the translated sentence. The common word ratio score is .

  • Aggregation Score (AScore). The quality of the translated sentences is not only decided by the readability (BLEU) but also related to the correctness (HIT) and simplicity (CWR). Among these three perspectives, readability and correctness should be more important than simplicity. Thus, we design a new score to model them at the same time, which is

    where and are parameters for controlling the importance of BLEU and HIT scores. If any of the three metrics is 0, then it will be added to a very small number such as to avoid AScore to be 0. We take and in our experiments.

5.3 Experimental Results

Performance Analysis. Table 2 shows the experimental results of all baselines on eleven evaluation metrics. We can observe that the naive Copy approach actually outperforms other baselines with traditional machine translation evaluation metrics, which is in accord with our discussion in the Evaluation Metrics section. Thus, it is necessary to introduce new metrics for a fair evaluation. On the three newly designed metrics, the Copy approaches perform the worst on the HIT score, and its AScore is not high, which is reasonable.

The statistical method Moses has fewer parameters in its language translation model, and thus, can achieve a stable performance among all the evaluation metrics. General neural network-based approaches, including Seq2Seq, its variants, and PointerNet, have a relatively low BLEU score. The reason is that the labeled data is insufficient for them to train a powerful translation model from scratch. PMBERT-MT is also a neural network-based method, but conducting pre-training on a large medical language corpus significantly increases the ability of model learning. Hence, the BLEU score of PMBERT-MT is very high. In fact, the performance of PMBERT-MT is the best among all the baselines.

From the view of the HIT score, we can find a sufficient gap between generation-based methods (including Seq2Seq, Seq2Seq-S, and PMBERT-MT) and other methods. To achieve a high HIT score, the accurate recognition of abbreviations is necessary. Moreover, the models should understand the context correctly, which is an advantage of neural network-based models. Besides, another requirement is the generation ability. The use of reference mechanism probably limits the generation ability of the PointerNet model, and thus, it does not achieve a high score in the HIT metric.

The CWR score can reflect the simplicity of sentences in a scene. However, we should notice that the higher CWR scores do not mean better performance. The reason is that translating professional medical terms will inevitably result in some uncommon words. Thus, we should attach less importance to the CWR score when calculating the comprehensive rank.

The top 3 models in the view of AScore are PMBERT-MT, Moses, and Seq2Seq, which is reasonable. The AScore attaches the highest importance to BLEU score, followed by the HIT and CWR scores. Using the harmonic mean can make sure that we punish the tendency of going overboard on one subject and guarantee good general performance.

she was taken to ir where 3 peripheral branches of the right renal artery was embolized , and a hd line was placed under sterile
procedure in the rij .
Reference 1:
she was taken to [interventional radiology] where 3 peripheral branches of the right kidney artery was embolized , and a [hemodialysis]
line was placed under sterile procedure in the [right internal jugular] .
Reference 2:
she was taken to [interventional radiology] where 3 peripheral branches of the right kidney artery was embolized , and a blood filtering
line was placed under sterile procedure in the right internal neck .
she was taken to interventional radiology where 3 peripheralial of the right renal artery was embolized , and a blood filtering line was
placed under sterile in the right internal jugular .
she was taken to interventional radiology where 3 3 of the right kidney artery kidney was was , and a blood filtering was placed under
cholangiography cholangiography procedure in the the
she was taken to the right coronary artery 3 3 , right groin artery was placed in the right femoral artery and right blood pressure of 3 .
she was taken to radiology where she where she the of the right right artery artery artery artery she artery artery was placed under , the
and she was under
she was taken to today where 3 peripheral branches of the right kidney artery was embolized , and a blood filtering was placed under
sterile procedure in the rij .
Table 3: An example that PMBERT-MT outperforms other baselines. We can observe that PMBERT-MT misses some words in the translated sentence, but all the prefession terms are translated. The BLEU score of PMBERT-MT is 0.8311, and the HIT score is 1.000, which are much higher than those of other baselines.
# cirrhosis : patient with history of alcoholic vs nash cirrhosis complicated by esophagel , gastric , and rectal varices
Reference 1:
# [chronic disease of the liver] : patient with history of alcoholic vs [non-alcoholic steatohepatitis] [chronic disease of the liver]
complicated by esophagel , gastric , and rectal varices .
Reference 2:
# [chronic disease of the liver] : patient with history of alcoholic vs liver inflammation and damage complicated by esophagel , gastric ,
and rectal varices .
cirrhosis cirrhosis : patient with history of alcoholic vs late disease disease complicated by food , , , , , , , eseseseseseseseseseseseseses .
# cirrhosis : patient with history of alcoholic vs cirrhosis cirrhosis , gastric , gastric , and , , , varices . 22
# surgical history of patient with history by , history , patient , by surgical by and by surgical tract .
# cirrhosis : patient with history of painful cell function cirrhosis complicated by , , , , , , and rectal rectal in rectal varices .
# cirrhosis : patient with history of alcoholic cirrhosis , complicated by nash esophagel , acid , and rectal and .
Table 4: A hard example that all the baselines cannot translate accurately.

Case Studies. We provide two examples to demonstrate the effectiveness of PMBERT-MT and its drawback. In Table 3, we can observe that PMBERT-MT performs the best. It can correctly translate professional medical terms as well as in a simplified manner. However, for the hard example in Table 4, almost all the baselines cannot achieve better performance. Only Moses keeps a relatively complete sentence structure. A possible reason is that other baselines may be unfamiliar with the word “cirrhosis”. From these results, we can find that it is challenging to accurately translate medical jargon to layperson-understandable language.

Insight Analysis. In our dataset, the length of different sentences varies a lot. To analyze the relationship between the length and the performance, we conduct the following experiment. We first divide source sentences of the testing data into five groups based on their length and then calculate the values of BLEU, HIT, CWR and AScore on the proposed PMBERT-MT model. The results are shown in Figure 4.

Figure 4: Sentence length v.s. performance.

From Figure 4, we can observe that with the increase of the sentence length, the values of BLEU, HIT, and AScore drop, which is in accord with traditional machine translation tasks. However, the trend of CWR score is different from that of other three metrics, which sightly increases. The reason is that for the long sentences, they may contain more common words and do not need to be translated, which leads to the increase of the CWR score. These results also confirm the reasonableness of the design of AScore, which assigns more importance weights to the BLEU and HIT scores compared with the CWR score. Also, we can conclude that using the CWR metric as the only criteria may not really reveal the performance of baselines.

6 Discussions

Form the experimental results, we can find that supervised machine translation methods generally achieve acceptable performance on this task. Using a pre-trained language model is an effective approach to reduce the learning difficulty, which leads to better performance. Next, we discuss the limitation of the MedLane dataset and the baselines. Also, we point out possible solutions to improve the performance of understandable medical language translation.

Limitations. When annotating the dataset, we remove the extremely hard instances, which contains many professional medical terms and abbreviations. In the current annotation procedure, workers need a dictionary to check the abbreviations and find their full forms. However, one abbreviation may have several full forms, and choosing the correct one may also need domain knowledge. There may be some out-of-dictionary terms in the dataset. Workers may ignore these words when conducting annotation. Besides, in the current dataset, all the sentences are written in lowercase. Some professional terms with the lowercase may be the same as common words, which may mislead all the baselines. Finally, the readability of training sentences needs to be further enhanced.

Future Work

. Though the PMBERT-MT model can achieve better performance, it is still not a perfect solution. To further improve the quality of translation, a possible solution may incorporate extra knowledge, such as medical knowledge graph, which may help models to understand the UMLT task more accurately. Another possible way may introduce information retrieval techniques into this task for generating understandable sentences. For example, Wikipedia is a good resource to extract some plain expressions for explaining professional medical jargon.

Since only a small portion of sentences are selected from clinical notes in the MIMIC-III dataset, the remaining ones still contain a lot of useful information. How to use such unlabeled data to improve the performance should be further studied. There are two possible ways to use unlabeled data. The first one is to conduct pre-training with unlabeled data and then fine-tune the current baselines with labeled data. The other way is to conduct semi-supervised learning to directly use the unlabeled data.

From the experimental results, we can observe that the traditional statistical machine translation model performs better for some hard examples in Table 4. Thus, combining neural machine translation models with traditional statistical models may improve the results.

7 Conclusions

In this paper, we introduce MedLane, which is a new human-annotated medical language translation dataset for solving the task of automatically translating professional medical sentences for ordinary users. In particular, we run seven baseline models, including directly copying, a statistical machine translation model, four neural machine translation models, and a text summarization model. Besides, three new metrics are designed according to the special requirements of medical language translation. From the experimental results, we find that the pre-trained model PMBERT-MT is a practical solution to address the challenges of the UMLT task. A success case and a failure case are given to show the translated results of existing models. Finally, a discussion section points out the limitations of the current dataset and the baselines, and we also discuss different research directions to further improve the performance of understandable medical language translation.


  • E. Alsentzer, J. Murphy, W. Boag, W. Weng, D. Jindi, T. Naumann, and M. McDermott (2019) Publicly available clinical BERT embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, Minneapolis, Minnesota, USA, pp. 72–78. External Links: Link, Document Cited by: §4.1.
  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: 3rd item, §4.2.
  • S. Bird, E. Klein, and E. Loper (2009) Natural language processing with python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”. Cited by: §5.1.
  • W. Caicedo-Torres and J. Gutierrez (2019) ISeeU: visually interpretable deep learning for mortality prediction inside the icu. Journal of biomedical informatics 98, pp. 103269. Cited by: §2.0.2.
  • P. Cao, Y. Chen, K. Liu, J. Zhao, S. Liu, and W. Chong (2020) HyperCore: hyperbolic and co-graph representation for automatic icd coding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3105–3114. Cited by: §2.0.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §4.1.
  • G. Du, J. Zhang, Z. Luo, F. Ma, L. Ma, and S. Li (2020)

    Joint imbalanced classification and feature selection for hospital readmissions

    Knowledge-Based Systems 200, pp. 106020. Cited by: §2.0.2.
  • Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, and H. Poon (2020) Domain-specific language model pretraining for biomedical natural language processing. arXiv preprint arXiv:2007.15779. Cited by: 3rd item, §4.1.
  • J. F. Ha and N. Longnecker (2010) Doctor-patient communication: a review. Ochsner Journal 10 (1), pp. 38–43. Cited by: §1.
  • A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-Wei, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark (2016) MIMIC-iii, a freely accessible critical care database. Scientific data 3 (1), pp. 1–9. Cited by: §1, §2.0.2.
  • S. Kandula, D. Curtis, and Q. Zeng-Treitler (2010) A semantic and syntactic text simplification tool for health content. In AMIA annual symposium proceedings, Vol. 2010, pp. 366. Cited by: §1, §2.0.1.
  • H. J. Kang, T. Chen, M. K. Chandrasekaran, and M. Y. Kan (2016) A comparison of word embeddings for english and cross-lingual chinese word sense disambiguation. Cited by: §2.0.1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.1.
  • P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, and E. Herbst (2007) Statistical machine translation. Cited by: §4.2.
  • A. Lavie and A. Agarwal (2007) METEOR: an automatic metric for mt evaluation with high levels of correlation with human judgments. In Proceedings of the second workshop on statistical machine translation, pp. 228–231. Cited by: §5.2.1.
  • J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang (2020) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36 (4), pp. 1234–1240. Cited by: §4.1.
  • F. Li and H. Yu (2020)

    ICD coding from clinical text using multi-filter residual convolutional neural network.

    In AAAI, pp. 8180–8187. Cited by: §2.0.2.
  • C. Lin (2004) Rouge: a package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81. Cited by: §5.2.1.
  • X. Liu, P. He, W. Chen, and J. Gao (2019) Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504. Cited by: §4.1.
  • F. Ma, Q. You, H. Xiao, R. Chitta, J. Zhou, and J. Gao (2018)

    Kame: knowledge-based attention model for diagnosis prediction in healthcare

    In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 743–752. Cited by: §2.0.2.
  • J. B. Marino, R. E. Banchs, J. M. Crego, A. de Gispert, P. Lambert, J. A. Fonollosa, and M. R. Costa-jussà (2006) N-gram-based machine translation. Computational linguistics 32 (4), pp. 527–549. Cited by: §4.2.
  • S. Narayan, N. Papasarantopoulos, S. B. Cohen, and M. Lapata (2017) Neural extractive summarization with side information. arXiv preprint arXiv:1704.04530. Cited by: §4.2.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318. Cited by: §5.1, §5.2.1.
  • S. Reddy, D. Chen, and C. D. Manning (2019) Coqa: a conversational question answering challenge. Transactions of the Association for Computational Linguistics 7, pp. 249–266. Cited by: §4.1.
  • C. Shivade (2018) Lessons from natural language inference in the clinical domain. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Cited by: §2.0.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §4.1.
  • R. Vedantam, C. Lawrence Zitnick, and D. Parikh (2015) Cider: consensus-based image description evaluation. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 4566–4575. Cited by: §5.2.1.
  • O. Vinyals, M. Fortunato, and N. Jaitly (2015) Pointer networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 2692–2700. External Links: Link Cited by: 3rd item, §4.2.
  • V. G. V. Vydiswaran, Q. Mei, D. A. Hanauer, and K. Zheng (2014)

    Mining consumer health vocabulary from community-generated text

    AMIA … Annual Symposium proceedings / AMIA Symposium. AMIA Symposium 2014, pp. 1150. Cited by: §1, §2.0.1.
  • W. Weng, Y. Chung, and P. Szolovits (2019) Unsupervised clinical language translation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3121–3131. Cited by: §1, §2.0.1, §2.0.2.
  • W. Weng and P. Szolovits (2018) Mapping unparalleled clinical professional and consumer languages with embedding alignment. Cited by: §2.0.1.
  • P. Xie and E. Xing (2018) A neural architecture for automated icd coding. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1066–1076. Cited by: §2.0.2.
  • X. Yang, K. Tang, H. Zhang, and J. Cai (2019) Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10685–10694. Cited by: §4.1.
  • Q. T. Zeng and T. Tse (2006) Exploring and developing consumer health vocabularies. Journal of the American Medical Informatics Association 13 (1), pp. 24–29. Cited by: §1, §2.0.1.
  • Q. Zeng-Treitler, S. Goryachev, H. Kim, A. Keselman, and D. Rosendale (2007) Making texts in electronic health records comprehensible to consumers: a prototype translator. In AMIA Annual Symposium Proceedings, Vol. 2007, pp. 846. Cited by: §1, §2.0.1.