Unsupervised Pseudo-Labeling for Extractive Summarization on Electronic Health Records

11/20/2018 ∙ by Xiangan Liu, et al. ∙ Petuum, Inc. 0

Extractive summarization is very useful for physicians to better manage and digest Electronic Health Records (EHRs). However, the training of a supervised model requires disease-specific medical background and is thus very expensive. We studied how to utilize the intrinsic correlation between multiple EHRs to generate pseudo-labels and train a supervised model with no external annotation. Experiments on real-patient data validate that our model is effective in summarizing crucial disease-specific information for patients.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Electronic Health Records (EHRs) are time-sensitive and patient-centered documents that make medical-related information available instantly and securely to authorized physicians. EHRs, however, are usually long and verbose. Physicians spend extensive time reading through contents written in unstructured or semi-structured natural languages to filter out irrelevant information and dig out the disease-specific problem lists such as past medical history, symptoms, and prescriptions. This process becomes even more time-consuming when one patient has several such records over many years.

Previous studies [9, 13, liu2018learning] focused on how to better utilize and digest information from EHRs to enhance the efficiency of healthcare services. Summarization is one of those techniques that can be applied, which generally has two approaches: abstractive and extractive summarization. As abstractive summarization sometimes fails to capture factual details accurately as needed in medical settings, we consider extractive summarization to be more suitable. This method directly extracts a subset of data written by the medical experts as the summary. Unsupervised extractive summarization was first explored [3, 10, 14]

. Due to the recent success of neural networks, supervised approaches become more popular for extractive summarization 

[1, 5, 6, 12, 18]. One obstacle for training a supervised model for extractive summarization in the medical domain, however, is the lack of labeled data, since annotations for EHRs require disease-specific medical background and can be very expensive.

In this work, we tried to train a supervised model, which has a better generalization ability than unsupervised models, with no direct human annotations. We studied how to utilize the intrinsic correlation between multiple notes for a single patient to generate pseudo-labels and guide summarization and explore to answer the following three Research Questions (RQ):

  • RQ1: How to measure the quality of disease-specific summarization for the same patient?

  • RQ2: Based on the criterion in RQ1, how can we generate pseudo-labels that best satisfy this criterion.

  • RQ3: Given pseudo-labels in RQ2, what model architecture should be used for summarization in a medical setting?

2 Our Approach

Figure 1: The workflow of our unsupervised summarization method for EHRs. Left part is the notes pairing process. Middle part generates pseudo-labels, which are fed to train the neural model on right part.

2.1 Problem Definition

Formally, we denote all EHRs recorded over time for one patient as , where indicates the oldest note and is the newest one. For any note , , it contains a sequence of sentences as . Our task is to find a subset , that best summarizes patient’s information for a specific disease and also follows some length restrictions.

For RQ1, we observed that when physicians read and summarize clinical notes, they focus more on medical entities that are related to a specific disease. For example, procedure entities such as "coronary artery bypass grafting" and "valve replacement" are very crucial for congestive heart failure; Lab test entities "hemoglobin" and "hematocrit" are informative for diagnosing anemia. We proposed to summarize clinical notes to cover more relevant entities.

The entity set of is . However, hundreds of entities could exist in and how to capture relevant ones remains to be a problem. Since this requires expensive medical expertise, we need a more efficient way to obtain patterns from clinical notes directly. We found that the important information in an early health record is usually mentioned briefly again in later records, reminding physicians to pay attention for future treatments. This information includes but is not limited to lab test results, diagnoses and medication usages. Inspired by this, we defined a coverage score similar to Yu et al.’s study on TV series recap [19], to evaluate the quality of based on one of its later records , where , defined as,

(1)

where is one element of and is the importance measured by Inverse Document Frequency (IDF) score in the entire corpus.

is a binary vector indicating whether one sentence is selected for

. calculates the semantic similarity between and sentence , as,

(2)

where and indicate the vectors that represent entity and word using average pooling of pre-trained word embeddings, which were trained on PubMed 111https://www.ncbi.nlm.nih.gov/pubmed/ using skipgram [4]. PubMed has a vocabulary that is more similar to EHR than general corpora and we used the abstracts of over 550,000 biomedical papers.

2.2 Pseudo-labeling with Integer Linear Programming

Based on the definition introduced above, our target is to generate the binary label vector for note using its later notes , where , which answers RQ2. In order to find the optimum to maximize

, we used the Integer Linear Programming (ILP) for this optimization problem, as shown in Figure

1. Ther target function for a pair of notes and as follows,

(3)

where is the number of words in and

is a hyperparameter for length restriction. One notable problem is that

is unsmooth, which requires unaffordable computational resources in real practice. To improve this, we used the log-sum-exp trick, which is a frequently adopted smooth approximation of the max function. Also, we added two more length constraints and to make the optimization faster. The final optimization problem is defined as,

(4)
s.t. (5)

2.3 Summarization Model

For RQ3, after we constructed training data with pseudo-labels

, a supervised neural model was trained to summarize medical records. The model predicts the probability of each sentence being picked for summarization. The model consists of a two-layer bi-directional GRU 

[7]. As shown in the right part of Figure 1, its first level Bi-GRU is on word level and generates sentence embedding. Taking it as input, the second layer Bi-GRU is on sentence level and final representation for each sentence is .

For the output layer for -th sentence, we used a logistic function which contains several features, including content, salience, novelty, position. The salience reflects how representative current sentence is according to the entire note. The novelty helps us avoid redundancy. We can predict the probability of selecting the current sentence as:

(6)

where is the position embedding of current sentence’s index . is the representation of the entire note as , where is the number of sentences in this note. The novelty feature is the weighted sum of the representations of already selected sentences, defined as .

For , cross-entropy loss was adopted to optimize the neural model over and , as follows,

(7)

3 Experiments and Results

3.1 Experimental Settings

Dataset. We used Medical Information Mart for Intensive Care III (MIMIC-III) [11] dataset to validate our approach. In order to train a model that is disease-specific, we extracted all admissions that contain at least one diagnostic ICD code related to heart disease. In total 5875 admissions from 1958 patients were used for training. Further, clinical notes were exported from the "NOTEEVENTS" table from the dataset. As for the test set, clinical notes from 25 admissions that not in the training set were examined and labeled by a heart-disease physician with over 15-year experience.

Note Pairing. For note pair and , , we required their time span is at least six months.

Baselines. Since our approach does not require any external annotations, our baselines are all unsupervised. The first one is Most-Entity (ME), which greedily picks sentences with most medical entities. The second baseline is TF-IDF, which weights sentences using the sum of words’ TF-IDF scores. To avoid duplicate information that could be raised by greedy search, we incorporated Maximal Marginal Relevance (MMR) [8] with TF-IDF, denoted as TF-IDF + MMR (TM). MMR subtracts similarity between a candidate and already selected sentences in summary from its weight. To make the comparison fair, we constrained all methods to summarize within the same word limit.

Metrics. ROUGE-1, ROUGE-2 and ROUGE-L [15, 16]

were adopted for evaluation, which measure recall-oriented overlap between automatically generated summary and reference.

Implementation Details. CliNER [2] was used to get the medical entity from clinical notes. For ILP, and are set as and respectively. For the neural model, both word embedding and Bi-GRU have 200 dimensions. Batch size is 16. We used Adam as our optimizer with learning rate .

3.2 Results and Discussions

Table 1 shows the evaluation results for Most-Entity, TF-IDF, TF-IDF + MMR and Ours. Our method achieved the best performance over all three metrics. We observe that redundancy has high impact on the performance. Without MMR and novelty, both TF-IDF and ours degraded significantly in performance. We also notice that the position term improves the performance of ours. This is within our expectation, since clinical notes are usually written by following some templates.

Methods ROUGE-1 ROUGE-2 ROUGE-L
Most Entity (ME) 0.41 0.26 0.40
TF-IDF 0.43 0.31 0.44
TF-IDF + MMR (TM) 0.49 0.35 0.48
Ours w/o novelty 0.48 0.33 0.47
Ours w/o position 0.50 0.36 0.49
Ours 0.53 0.38 0.51
Table 1: Experimental results for summarization
Sentence ME TM Ours REF
1 Chronic systolic heart failure × ×
2 Clopidogrel 75 mg × × ×
3
Facility with copd, lifelong current tobacco abuse
×
4
Atrovent hfa 17 mcg/actuation aerosol sig one (1)
× × ×
5
Please draw vanco level, hct, bun, creatinine on and call results to doctor.
× ×
6
PT was admitted with presumed CHF exacerbation most likely secondary to the increased
dietary salt intake and severe aortic stenosis.
Table 2: Some examples of extracted sentences for summarization by different methods. REF indicates physician’s annotation. means the sentence is selected and × means it is not chosen.

Table 2 shows examples for sentences predicted by ME, TM and Ours. We have interesting findings.

TM and ME have a strong bias towards long sentences with more entities. They failed to extract important diagnoses such as No.1, which is short in length. No.2 shows that our method also prefers long sentences for the prescription, due to the use of log-sum-exp trick, but not as biased as TM and ME. For No.3, CliNER only extracted one entity "copd", which led ME to make a wrong decision. For No.4, several infrequent terms increased its TF-IDF weight. However, those terms are not very relevant to heart disease and REF actually excludes this sentences.

Learning from pseudo-labels, our method is capable of determining which entities are more relevant to a disease, heart disease in this case. Both TM and ME falsely selected No.5, while Ours considers it as unimportant correctly. This sentence is actually an instruction for patients themselves and not very important for physicians. No.6 is an example where all methods worked perfectly.

4 Conclusion

In this work, we studied the problem of summarizing EHRs and explored three research questions. For RQ1, we proposed to utilize medial entities to cover the intrinsic correlation between multiple EHRs for one patient. For RQ2, we developed an optimization target and used ILP to generate pseudo-labels, which requires no external human supervision. Then for RQ3, those pseudo-labels helped us train a supervised extractive neural model, where the RNN increases the ability to understand contexts and distinguish irrelevant information. We also proposed and validated that adding novelty features to avoid duplicates and considering the position of sentences are significant in summarizing EHRs. Experiments showed that our method outperforms existing unsupervised baselines, which has great potential in helping physicians better understand patients’ medical history, reducing costs and improving the quality of patient care.

Acknowledgement

The authors thank Hongyang Zhang, Yaodong Yu, Jiacheng Xu, Shikun Zhang and Sean Chen for valuable help.

References

  • [1] K. Arumae and F. Liu. Reinforced extractive summarization with question-focused rewards. In Proceedings of ACL 2018, Student Research Workshop, pages 105–111, 2018.
  • [2] W. Boag, K. Wacome, T. Naumann, and A. Rumshisky.

    Cliner: A lightweight tool for clinical named entity recognition.

    AMIA Joint Summits on Clinical Research Informatics (poster), 2015.
  • [3] R. Brandow, K. Mitze, and L. F. Rau. Automatic condensation of electronic publications by sentence selection. Information Processing & Management, 31(5):675–685, 1995.
  • [4] C. J. C. Burges, L. Bottou, Z. Ghahramani, and K. Q. Weinberger, editors. Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013.
  • [5] Z. Cao, W. Li, S. Li, F. Wei, and Y. Li. Attsum: Joint learning of focusing and summarization with neural attention. arXiv preprint arXiv:1604.00125, 2016.
  • [6] J. Cheng and M. Lapata. Neural summarization by extracting sentences and words. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, 2016.
  • [7] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
  • [8] W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel, editors. SIGIR ’98: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Australia. ACM, 1998.
  • [9] E. Ford, J. A. Carroll, H. E. Smith, D. Scott, and J. Cassell. Extracting information from the text of electronic medical records to improve case detection: a systematic review. JAMIA, 23(5):1007–1015, 2016.
  • [10] Z. He, C. Chen, J. Bu, C. Wang, L. Zhang, D. Cai, and X. He. Document summarization based on data reconstruction. In

    Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence

    , 2012.
  • [11] A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-wei, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark. Mimic-iii, a freely accessible critical care database. Scientific data, 3:160035, 2016.
  • [12] M. Kågebäck, O. Mogren, N. Tahmasebi, and D. Dubhashi. Extractive summarization using continuous vector space models. In Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC), pages 31–39, 2014.
  • [13] J. Y. Lee, H. Park, and E. Chung. Use of electronic critical care flow sheet data to predict unplanned extubation in icus. I. J. Medical Informatics, 117:6–12, 2018.
  • [14] C. Li, X. Qian, and Y. Liu. Using supervised bigram-based ilp for extractive summarization. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1004–1013, 2013.
  • [15] C. Lin and E. H. Hovy.

    Automatic evaluation of summaries using n-gram co-occurrence statistics.

    In Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, HLT-NAACL 2003, Edmonton, Canada, May 27 - June 1, 2003, 2003.
  • [16] C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out, 2004.
  • [17] P. J. Liu. Learning to write notes in electronic health records. CoRR, abs/1808.02622, 2018.
  • [18] R. Nallapati, F. Zhai, and B. Zhou.

    Summarunner: A recurrent neural network based sequence model for extractive summarization of documents.

    In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pages 3075–3081, 2017.
  • [19] H. Yu, S. Zhang, and L. Morency. Unsupervised text recap extraction for TV series. In

    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016

    , pages 1797–1806, 2016.