Log In Sign Up

End-to-end Clinical Event Extraction from Chinese Electronic Health Record

by   Wei Feng, et al.

Event extraction is an important work of medical text processing. According to the complex characteristics of medical text annotation, we use the end-to-end event extraction model to enhance the output formatting information of events. Through pre training and fine-tuning, we can extract the attributes of the four dimensions of medical text: anatomical position, subject word, description word and occurrence state. On the test set, the accuracy rate was 0.4511, the recall rate was 0.3928, and the F1 value was 0.42. The method of this model is simple, and it has won the second place in the task of mining clinical discovery events (task2) in the Chinese electronic medical record of the seventh China health information processing Conference (chip2021).


page 1

page 2

page 3

page 4


Bidirectional Recurrent Neural Networks for Medical Event Detection in Electronic Health Records

Sequence labeling for extraction of medical events and their attributes ...

Text2Event: Controllable Sequence-to-Structure Generation for End-to-end Event Extraction

Event extraction is challenging due to the complex structure of event re...

Active Learning for Chinese Word Segmentation in Medical Text

Electronic health records (EHRs) stored in hospital information systems ...

A multi-perspective combined recall and rank framework for Chinese procedure terminology normalization

Medical terminology normalization aims to map the clinical mention to te...

Multi-domain Clinical Natural Language Processing with MedCAT: the Medical Concept Annotation Toolkit

Electronic health records (EHR) contain large volumes of unstructured te...

Regressing Location on Text for Probabilistic Geocoding

Text data are an important source of detailed information about social a...

1 Introduction

Electronic health record (EHR) is composed of unstructured text and structured data. Therefore, information extraction is one of the key tasks in processing unstructured text in EHR. The task of EHR information extraction is mainly concentrated in the drug [wei_study_2020, baer_can_2016, zheng_medication_2015] and disease fields [sada_validation_2016, xu_extracting_2011, liu_use_2021].

Event extraction is an important task among information extraction tasks. General event extraction task is mostly based on the identification or classification of trigger words, event types, event elements, arguments, etc. [zhan_survey_2019]. The event extraction tasks can play a key role in question answering, knowledge extraction, and knowledge map construction [berant_modeling_2014]. Most of these tasks decomposed the event task into multiple sub-tasks, including extraction of trigger words, extraction of attributes, and merging of the sub modules. Such architecture requires high quality annotation of the original sentence, but it will become a time-consuming and laborious problem in medical treatment. In this task, we applied the end-to-end generation model to output the event extraction information into a structured and enhanced character string, and obtained the score of F1 0.42 in task 2 of the 2021 China health information processing Conference (chip2021)111, ranking second.

2 Related works

Traditional event tasks mostly contained classifiers based on pattern recognition or machine learning methods, such as Monte Carlo Gibbs sampling

[finkel_incorporating_2005], conditional random fields [finkel_exploiting_2004]

, support vector machines


, and so on. With the extensive application of deep learning, deep neural network model is also more and more applied to the task of event extraction, such as convolutional neural network

[chen_event_2015] and graph neural network [liu_jointly_2018]. In medical text event extraction, traditional rule-based models [tian_automated_2017, nath_natural_2016, lin_tepapa_2017] were mostly used, and the event extraction model based on deep learning method [shi_multiple_2016] had also been popularized in recent years.

Most of these studies used the step-by-step event detection paradigm, that is, the detection of trigger words of events and the detection of arguments (event attributes). In this way, the two tasks were decomposed, and it was difficult to correlate them. Moreover, most of these models were too complex to do further task processing. Therefore, we combined the trigger word (core word) and argument (event attribute) of an event into one task, used the end-to-end paradigm as seq2seq [sutskever_sequence_2014] to output both at the same time.

Seq2seq model contains encoder and decoder modules. The encoder module accepts the input sequence

coded as an implicit tensor

. The target sequence is then output through the Decoder module. On this basis, Google AI proposed a Text-to-Text Transfer Transformer (T5) [raffel_exploring_2020] model. The T5 model combines the advantages of the Transformer architecture [vaswani_attention_2017]

to unify natural language processing tasks into end-to-end tasks. The T5 model using the Transformer architecture uses stacked multi-headed attention mechanisms, considers relative character location information, and fuses context semantics to achieve global information learning.

3 Methods

The objective of the CHIP2021 task2 task is to extract clinical discovery events from Chinese electronic medical records, all of which come from real medical data. For example, "…, the above symptoms occur repeatedly, without obvious incentives for each attack, they occur suddenly and last for several minutes…" should extract event attributes such as "core name: symptoms, tendencies: yes, characteristics: recurrence, no incentives, sudden occurrence". Each event of a dataset has four attributes (core name, trendency, role, and anatomy). Roles and anatomies can be multiple entities in each core name.

Figure 1: Overall distribution of medical record length.

The medical record data set of chip2021 task2 was used in this study. The data set has 2060 medical records, including 11129 events. The length distribution of medical records in the data set is shown in Figure 1. The length of more than 95% of sentences is within 200. 1854 training sets were randomly extracted, and the remaining 206 were used as verification sets. The training set contained 10135 events and the training set contained 994 events.

Figure 2:

Structure of event text generation model.

Figure 2 describes the brief structure of our model. The input was a medical sentence from EHR. After trained through a seq2seq model, the output was our customized structured text. The input text waas formatted as , where each represents the input word item, and represents the number of word items of the input sentence. The goal of the model was to extract all attributes of a event which was formatted as , where represents the attribute, represents the length of the attribute.

max width= models input output baseline The outpatient was admitted to the hospital in "postoperative of rectal cancer". cancer<p>yes<p>postoperation<p>rectum ours The outpatient was admitted to the hospital in "postoperative of rectal cancer". <ent>cancer<tendency>yes<character>postoperation<anatomy>rectum

Table 1: Input output examples for baseline and our models.

In our model, we used special tokens corresponding to the event attribute, and formatted the output as , where to is a custom special token. In this task, as shown in Table 1, these special tokens were represented as: <ent>, <tency>, <character>, <anatomy>. to were corresponding to "core name", "tendency", "characteristic" and "anotomy". In each events, and might be multiple options, because there were multiple entities in "characteristic" and "anotomy". Therefore, <unk> tag is used as the separator. For non-existent attributes, we defined that <null> as the null tag. We applied a model with non-special tokens as baseline model.

The experiment was fine-tuned using Mengzi-T5-base pre-training model [zhang_mengzi_2021]

. The dictionary size was 32128, the number of attention heads was 12, the training learning rate was 2e-5, epoch was 50, batch size was 16, the maximum input length was 256, the maximum output length was 128, and beam search length was 3. The model was stored on the validation set with lowest loss.

The indicators for this task calculate P (Precision), R (Recall), and F1 values. Multiple attributes may appear for one event per text. The event attributes is used to calculate the metrics, and all attributes need to be completely correct to calculate the F1 value. The following formula is how F1 is calculated:


4 Results

Figure 3: Training loss after each epoch.

Figure 3 shows the loss results of the models after each epoch training, and the models has converged well at the 25-th epoch. Subsequent increased in loss may be related to learning rates. Moreover, our model with type-specific characters has lower loss than baseline, indicating that type-specific characters have the ability to improve convergence of model.

models Core words Other attributes Events
P R F1 P R F1 P R F1
baseline 0.8342 0.7681 0.7997 0.6342 0.5939 0.6081 0.5127 0.4720 0.4915
our 0.8505 0.8047 0.827 0.6623 0.6266 0.6440 0.5398 0.5107 0.5248
Table 2: The proportion of events not in the original sentence.

As shown in Table 2, our model with special tokens has best accuracy, recall, and F1 scores than the baseline model, which used the same tag to enhance characters. F1 score improved 3% compared to the baseline model. Special tokens which represented attributes in our model had the ability to enhance events recognition. Recall score does not perform well in both models. The recall score in baseline is only 0.47, while our model is only 0.51. This indicated that there were a large number of False Negatives in the generated model and a large number of unrelated words were extracted.

5 Error analysis

In order to analyze the causes of errors, we verified the recognition results of core words at the same position. As shown in Table 3, although the F1 score of core words at the same position is 3.24% higher than that of the baseline model, it is nearly 25% lower than that of core words extracted at non fixed positions. It shows that the main reason affecting the accuracy of event extraction is that some core words are missing. The reasons for the errors are analyzed below. After checking the prediction results, it is found that the main errors are as follows.

models P R F1
baseline 0.5845 0.5381 0.5604
our 0.6097 0.5768 0.5928
Table 3: Result of models on core words with strict position.

Firstly, rare core words, such as "… coronary cta:1. right dominant coronary artery 2. left dominant coronary artery…", should be extracted from the core word "right dominant coronary artery". However, these words occur less often, so they appear with incorrect markers or missed labels.

Secondly, core words with similar contextual structures, such as "… intermittent white phlegm, intermittent coughing of dark red blood", should be extracted from the following events:a. The core word "cough up phlegm", characterized by "intermittent, white, sticky"; B. Core word "cough up blood", characterized by "dark red, intermittent". However, there are omissions in this model. The extracting events are: the core word "cough up phlegm", the feature "dark red, white, sticky, intermittent". Because the contextual structure of the two event core words is similar, the model incorrectly extracts "dark red" as the feature of the core word "cough-phlegm". As a result, the two events are merged into one event, which affects the extraction of the two events.

Thirdly, the influence of the pre-training model, such as "no white pottery stool", should be extracted as "core word: stool, tendency: negative, characteristic: white pottery stool", but this model is extracted as "core word: stool characteristics". Perhaps in the pre-training model, "stool characteristics" is a common word, so when predicting the next word of the word "stool", the "characteristics" is more likely to occur than the special token <trend>. Because "stool" is a common core word in this dataset, this reason has a greater impact on performance.

From the error analysis above, we can see that there are three main reasons for the omission of core words. Possible solutions include: 1. For rare core words, most of them lack professional vocabulary. By adding medical knowledge, it is expected to further increase the probability of specific vocabulary generation. 2. For the situation where the context structure of core words is close to each other, consider the extraction of separated core words and their attributes, and use the extraction of the previous step as the hint of the latter step to improve the difference between them. 3. For the effect of the pre-training model, consider constraining the generation process and strengthening the decoding process of extracting content from the original text, but may result in the loss of custom content, consider adding standardization steps.

6 Conclusion

In order to extract events from clinical records and their four attributes, in this task, we used a model generation combined with enhanced structured output to enhance event attributes with special tokens. Because our method is simple and does not require text labeling, it has great potential.