Radiology reports convey the detailed observations along with the significant findings about a medical encounter. Each radiology report contains two important sections:111Depending on institution, radiology reports may or may not include other fields such as Background. Findings that encompasses radiologist’s detailed observations and interpretation of imaging study, and Impression summarizing the most critical findings. Impression (usually couple of lines and thrice smaller than finding) is considered as the most integral part of report Ware2017EffectiveRR as it plays a key role in communicating critical findings to referring clinicians. Previous studies have reported that clinicians mostly read the Impression as they have less time to review findings, particularly those that are lengthy or intricate Flanders2012RadiologyRA; Xie2019IntroducingIE.
In clinical setting, generating Impression from Findings can be subject to errors gershanik2011critical; Brady2016ErrorAD. This fact is especially crucial when it comes to healthcare domain where even the smallest improvement in generating Impression can improve patients’ well-being. Automating the process of impression generation in radiology reporting would save clinicians’ read time and decrease fatigue Flanders2012RadiologyRA; Kovacs2018BenefitsOI as clinicians would only need to proofread summaries or make minor edits.
Previously, MacAvaney2019OntologyAwareCA showed that augmenting the summarizer with entire ontology (i.e., clinical) terms within the Findings can improve the content selection and summary generation to some noticeable extent. Our findings, further, suggest that radiologists select significant ontology terms, but not all such terms, to write the Impression. Following this paradigm, we hypothesize that selecting the most significant clinical terms occurring in the Findings and then incorporating them into the summarization would improve the final Impression generation. We further examine if refining Findings word representations according to the identified clinical terms would result in improved Impression generation.
Overall, the contributions of this work are twofold: (i) We propose a novel seq2seq-based model to incorporate the salient clinical terms into the summarizer (§3.2). We pose copying likelihood of a word as an indicator of its saliency in terms of forming Impression, which can be learned via a sequence-tagger (§3.1); (ii) Our model statistically significantly improves over the competitive baselines on MIMIC-CXR publicly available clinical dataset. To evaluate the cross-organizational transferability, we further evaluate our model on another publicly available clinical dataset (OpenI) (§5).
2 Related Work
Few prior studies have pointed out that although seq2seq models can effectively produce readable content, they perform poorly at selecting salient content to include in the summary Gehrmann2018BottomUpAS; Lebanoff2019ScoringSS. Many attempts have been made to tackle this problem Zhou2017SelectiveEF; Lin2018GlobalEF; Hsu2018AUM; Lebanoff2018AdaptingTN; You2019ImprovingAD. For example, Zhou2017SelectiveEF used sentence representations to filter secondary information of word representation. Our work is different in that we utilize ontology representations produced by an additional encoder to filter word representations. Gehrmann2018BottomUpAS utilized a data-efficient content selector, by aligning source and target, to restrict the model’s attention to likely-to-copy phrases. In contrast, we use the content selector to find domain knowledge alignment between source and target. Moreover, we do not focus on model attention here, but on rectifying word representations.
Extracting clinical findings from clinical reports has been explored previously Hassanpour2016InformationEF; Nandhakumar2017ClinicallySI. For summarizing radiology reports, Zhang2018LearningTS recently used a separate RNN to encode a section of radiology report.222Background field. Subsequently, MacAvaney2019OntologyAwareCA extracted clinical ontologies within the Findings to help the model learn these useful signals by guiding decoder in generation process. Our work differs in that we hypothesize that all of the ontological terms in the Findings are not equally important, but there is a notion of odds of saliency for each of these terms; thus, we focus on refining the Findings representations.
Our model consists of two main components: (1) a content selector to identify the most salient ontological concepts specific to a given report, and (2) a summarization model that incorporates the identified ontology terms within the Findings into the summarizer. The summarizer refines the Findings word representation based on salient ontology word representation encoded by a separate encoder.
3.1 Content Selector
The content selection problem can be framed as a word-level extraction task in which the aim is to identify the words within the Findings that are likely to be copied into the Impression. We tackle this problem through a sequence-labeling approach. We align Findings and Impression to obtain required data for sequence-labeling task. To this end, let be the binary tags over the Findings terms , with being the length of the Findings. We tag word with 1 if it meets two criteria simultaneously: (1) it is an ontology term, (2) it is directly copied into Impression, and 0 otherwise. At inference, we characterize the copying likelihood of each Findings term as a measure of its saliency.
Recent studies have shown that contextualized word embeddings can improve the sequence-labeling performance Devlin2019BERTPO; Peters2018DeepCW
. To utilize this improvement for the content selection, we train a bi-LSTM network on top of the BERT embeddings with a softmax activation function. The content selector is trained to maximize log-likelihood loss with the maximum likelihood estimation. At inference, the content selector calculates the selection probability of each token in the input sequence. Formally, letbe the set of ontological words which the content selector predicts to be copied into the Impression:
where is a mapping function that takes in Findings tokens and outputs word sequences from input tokens if they appear in the ontology (i.e., RadLex) 333RadLex version 3.10, http://www.radlex.org/Files/radlex3.10.xlsx, and otherwise skips them. denotes the selection probability of ontology word , and is the copying threshold.
3.2 Summarization Model
We exploit two separate encoders: (1) findings encoder that takes in the Findings
, and (2) ontology encoder that maps significant ontological terms identified by the content selector to a fix vector known as ontology vector. The findings encoder is fed with the embeddings ofFindings words, and generates word representations . Then, a separate encoder, called ontology encoder, is used to process the ontology terms identified by the content selector and produce associated representations .
where is the Findings text, is the set of ontology terms occurring in the Findings and identified by the content selector, is the word representations yielded from the ontology encoder. Note that –called ontology vector– is the last hidden state containing summarized information of significant ontologies in the Findings.
3.2.2 Ontological Information Filtering
Although de facto seq2seq frameworks implicitly model the information flow from encoder to decoder, the model should benefit from explicitly modeling the selection process. To this end, we implement a filtering gate on top of the findings encoder to refine the Findings word representations according to the significant ontology terms within the Findings
and produce ontology-aware word representations. Specifically, the filtering gate receives two vectors: the word hidden representationthat has the contextual information of word , and the ontology vector including the overal information of significant ontology words within the Findings. The filtering gate processes these two vectors through a liner layer with Sigmoid activation function. We then compute the ontology-aware word hidden representation , given the source word hidden representation and the associated filtering gate .
where is the weight matrix, denotes the bias term, and denotes element-wise multiplication.
3.2.3 Impression Decoder
We use an LSTM network as our decoder to generate the Impression iteratively. In this sense, the decoder computes the current decoding state , where is the input to the decoder (human-written summary tokens at training, or previously generated tokens at inference) and is the previous decoder state. The decoder also computes an attention distribution with being the ontology-aware word representations. The attention weights are then used to compute the context vector where is the length of the Findings. Finally, the context vector and decoder output are used to either generate the next token from the vocabulary or copy it from the Findings.
4.1 Dataset and Ontologies
MIMIC-CXR. This collection Johnson2019MIMICCXRAL is a large publicly available dataset of radiology reports. Following similar report pre-processing as done in Zhang2018LearningTS, we obtained 107,372 radiology reports.
For tokenization, we used ScispaCy Neumann2019ScispaCyFA. We randomly split the dataset into 80%(85,898)-10%(10,737)-10%(10,737) train-dev-test splits.
OpenI. A public dataset from the Indiana Network for Patient Care DemnerFushman2016PreparingAC with 3,366 reports. Due to small size, it is not suitable for training; we use it to evaluate the cross-organizational transferability of our model and baselines.
We use RadLex, a comprehensive radiology lexicon, developed by Radiological Society of North America (RSNA), including 68,534 radiological terms organized in hierarchical structure.
We compare our model against both known and state-of-the-art extractive and abstractive models.
LSA Steinberger2004LSA: An extractive vector-based model that employs Sigular Value Decomposition (SVD) concept.
NeuSum Zhou2018ND: A state-of-the-art extractive model that integrates the process of source sentence scoring and selection.444We use open code at https://github.com/magic282/NeuSum with default hyper-parameters.
Pointer-Generator (PG) See2017GetTT: An abstractive summarizer that extends ses2seq networks by adding a copy mechanism that allows for directly copying tokens from the source.
Ontology-Aware Pointer-Generator (Ont. PG) MacAvaney2019OntologyAwareCA: An extension of PG model that first encodes entire ontological concepts within Findings, then uses the encoded vector to guide decoder in summary decoding process.
Bottom-Up Summarization (BUS) Gehrmann2018BottomUpAS: An abstractive model which makes use of a content selector to constrain the model’s attention over source terms that have a good chance of being copied into the target.555We re-implemented the BUS model.
4.3 Parameters and Training
We use SciBert model Beltagy2019SciBERTAP which is pre-trained over biomedical text. We employ 2-layer bi-LSTM encoder with hidden size of 256 upon Bert
model. The dropout is set to 0.2. We train the network to minimize cross entropy loss function, and optimize using Adam optimizerDiedrik2014Adam with learning rate of .
For the summarization model, we extended on the open base code by Zhang2018LearningTS for implementation.666https://github.com/yuhaozhang/summarize-radiology-findings We use 2-layer bi-LSTM, 1-layer LSTM as findings encoder, ontology encoder, and decoder with hidden sizes of 200 and 100, respectively. We also exploit 100d GloVe embeddings pretrained on a large collection of 4.5 million radiology reports Zhang2018LearningTS. We train the network to optimize negative log likelihood with Adam optimizer and a learning rate of 0.001.
|Ours (this work)||53.57||40.78||51.81|
shows the statistical significance (paired t-test,).
5 Results and Discussion
5.1 Experimental Results
Table. 1 shows the Rouge scores of our model and baseline models on MIMIC-CXR, with human-written Impressions as the ground truth. Our model significantly outperforms all the baselines on all Rouge metrics with 2.9%, 2.5%, and 1.9% improvements for RG-1, RG-2, and RG-L, respectively. While NeuSum outperforms the non-neural LSA in extractive setting, the extractive models lag behind the abstractive methods considerably, suggesting that human-written impressions are formed by abstractively selecting information from the findings, not merely extracting source sentences. When comparing Ont. PG with our model, it turns out that indeed our hypothesis is valid that a pre-step of identifying significant ontological terms can improve the summary generation substantially. As pointed out earlier, we define the saliency of an ontological term by its copying probability.
As expected, BUS approach achieves the best results among the baseline models by constraining decoder’s attention over odds-on-copied terms, but still underperforms our model. This may suggest that the intermediate stage of refining word representations based on the ontological word would lead to a better performance than superficially restricting attention over the salient terms. Table. 3 shows the effect of content selector on the summarization model. For the setting without content selector, we encode all ontologies within the Findings. As shown, our model statistically significantly improves the results on RG-1 and RG-2.
|Ours (this work)||40.88||24.44||40.37|
|w/o Cont. Sel.||52.47||40.11||51.39|
|w/ Cont. Sel.||53.57||40.78||51.81|
To further evaluate the transferability of our model across organizations, we perform an evaluation on OpenI with our best trained model on MIMIC-CXR. As shown in Table. 2, our model significantly outperforms the top-performing abstractive baseline model suggesting the promising cross-organizational transferability of our model.
5.2 Expert Evaluation
While our approach achieves the best Rouge scores, we recognize the limitation of this metric for summarization task Cohan2016RevisitingSE. To gain a better understanding of qualities of our model, we conducted an expert human evaluation. To this end, we randomly sampled 100 system-generated Impressions with their associated gold from 100 evenly-spaced bins (sorted by our system’s RG-1) of MIMIC-CXR dataset. The Impressions were shuffled to prevent potential bias. We then asked three experts 777Two radiologists and one medical student. to score the given Impressions independently on a scale of 1-3 (worst to best) for three metrics: Readability. understandable or nonsense; Accuracy. fully accurate, or containing critical errors; Completeness. having all major information, or missing key points.
Figure. 2 presents the human evaluation results using histograms and arrow plots as done in MacAvaney2019OntologyAwareCA, comparing our system’s Impressions versus human-written Impressions. The histograms indicate the distribution of scores, and arrows show how the scores changed between ours and human-written. The tail of each arrow shows the score of human-written Impression , and its head indicates the score of our system’s Impression. The numbers next to the tails express the count of Impressions that gained score of by ours and by gold. 888 We observed that while there is still a gap between the system-generated and human-written Impressions, over 80% of our system-generated Impressions are as good 999Either tied or improved. as the associated human-written Impressions. Specifically, 73% (readability), and 71% (accuracy) of our system-generated Impressions ties with human-written Impressions, both achieving full-score of 3; nonetheless, this percentage is 62% for completeness metric. The most likely explanation of this gap is that deciding which findings are more important (i.e., should be written into Impression) is either subjective, or highly correlates with the institutional training purposes. Hence, we recognize cross-organizational evaluations in terms of Impression completeness as a challenging task. We also evaluated the inter-rater agreement using Fleiss’ Kappa Fleiss1971MeasuringNS for our system’s scores and obtained 52% for readability, 47% for accuracy, and 50% for completeness, all of which are characterized as moderate agreement rate.
We proposed an approach to content selection for abstractive text summarization in clinical notes. We introduced our novel approach to augment standard summarization model with significant ontological terms within the source. Content selection problem is framed as a word-level sequence-tagging task. The intrinsic evaluations on two publicly available real-life clinical datasets show the efficacy of our model in terms of Rouge metrics. Furthermore, the extrinsic evaluation by domain experts further reveals the qualities of our system-generated summaries in comparison with gold summaries.
We thank Arman Cohan for his valuable comments on this work. We also thank additional domain expert evaluators: Phillip Hyuntae Kim, and Ish Talati.