A critical task in radiology practice is the generation of a free-text description, or report, based on a clinical radiograph (e.g., a chest X-ray). Providing automated support for this task has the potential to ease clinical workflows and improve both the quality and standardization of care. However, this process poses significant technical challenges. Many traditional image captioning approaches are designed to produce far shorter and less complex pieces of text than radiology reports. Further, these approaches do not capitalize on the highly templated nature of radiology reports. Additionally, generic natural language generation (NLG) methods prioritize descriptive accuracy only as a byproduct of readability, whereas providing an accurate clinical description of the radiograph is the first priority of the report. Prior works in this domain have partially addressed these issues, but significant gaps remain towards producing high-quality reports with maximal clinical efficacy.
In this work, we take steps to address these gaps through our novel automatic chest X-ray radiology report generation system. Our model hierarchically generates a sequence of unconstrained topics, using each topic to generate a sentence for the final generated report. In this way, we capitalize on the often-templated nature of radiology reports while simultaneously offering the system sufficient freedom to generate diverse, free-form reports. The system is finally tuned via reinforcement learning to optimize readability (via the CIDEr score) as well as clinical accuracy (via the concordance of CheXpert (Irvin et al., 2019) disease state labels between the ground truth and generated reports). We test this system on the MIMIC-CXR (Johnson et al., 2019)
dataset, which is the largest paired image-report dataset presently available, and demonstrate that our model offers improvements on both NLG evaluation metrics (BLEU(Papineni et al., 2002), CIDEr (Vedantam et al., 2015), and ROGUE (Lin, 2004)) and clinical efficacy metrics (CheXpert concordance) over several compelling baseline models, including a re-implementation of TieNet (Wang et al., 2018), simpler neural baselines, and a retrieval-based baseline.
This work focuses on generating a clinically useful radiology report from a chest X-ray for clinicians. This task has been explored multiple times, but directly transplanting natural language generation techniques into the field only guarantees the reports to look real rather than to predict right. A more immediate focus for the report generation task is thus to produce accurate disease profiles to power downstream tasks such as diagnosis. Our goal is then minding the language fluency while also increasing the clinical efficacy of the generated reports.
We employ a hierarchical convolutional-recurrent neural network as the backbone for our proposed method. Reinforcement learning (RL) on a combined objective of both language fluency metrics and the proposed Clinically Coherent Reward ensures we obtain a quality model on more correctly describing disease states. Our method aims to numerically align the disease labels of our generated report, as produced by a natural language labeler, with the labels from the ground truth reports. The reward function, though non-differentiable, is able to be optimized through policy gradient learning as promised by RL.
2 Background & Related Work
Diagnostic radiology is the medical field of creating and evaluating radiological images (radiographs) of patients for the purposes of diagnostics. Radiologists are trained to simultaneously identify various radiological findings (e.g., diseases), according to the details of the radiograph and the patient’s clinical history, then summarize these findings and their overall impression in reports for clinical communication (Kahn Jr et al., 2009; Schwartz et al., 2011). A report typically consists of sections such as history, examination reason, findings, and impressions. As shown in Figure 1, the findings section contains a sequence of positive, negative, or uncertain mentions of either disease observations or instruments including their detailed location and severity. The impression section, by contrast, summarizes diagnoses considering all report sections above and previous studies on the patient.
Correctly identifying all abnormalities is a challenging task due to high variation, atypical cases, and the information overload inherent to some imaging modalities, such as computerized tomography (CT) scans (Rubin, 2015)
. This presents a strong intervention surface for machine learning techniques to help radiologists correctly identify the critical findings from a radiograph. The canonical way to communicate such findings in current practice would be through the free-text report, which could either be used as a “draft” report for the radiologists to extend, or be presented to the physician requesting a radiological study directly(Schwartz et al., 2011).
AI on Radiology Data
|Dataset||Source Institution||Disease Labeling||# Images||# Reports||# Patients|
|Open-I||Indiana Network for Patient Care||Expert||8,121||3,996||3,996|
|Chest-Xray8||National Institutes of Health||Automatic|
|(DNorm + MetaMap)||108,948||0||32,717|
|PadChest||Hospital Universitario de San Juan||Expert + Automatic|
|MIMIC-CXR||Beth Israel Deacones Medical Center||Automatic|
In recent years, several chest radiograph datasets, totalling almost a million X-ray images, have been made publicly available. A summary of these datasets is available in Table 1. Learning effective computational models through leveraging the information in medical images and free-text reports is an emerging field. Such a combination of image and textual data help further improve the model performance in both image annotation and automatic report generation (Litjens et al., 2017).
Schlegl et al. (2015)
first proposed a weakly supervised learning approach to utilize semantic descriptions in reports as labels for better classifying the tissue patterns in optical coherence tomography (OCT) imaging. In the field of radiology,Shin et al. (2016) proposed a convolutional and recurrent network framework that jointly trained from image and text to annotate disease, anatomy, and severity in the chest X-ray images. Similarly, Moradi et al. (2018) jointly processed image and text signals to produce regions of interest over chest X-ray images. Rubin et al. (2018) trained a convolutional network to predict common thoracic diseases given chest X-ray images. Shin et al. (2015), Wang et al. (2016), and Wang et al. (2017)
mined radiological reports to create disease and symptom concepts as labels. They first used Latent Dirichlet Allocation (LDA) to identify the topics for clustering, then applied the disease detection tools such as DNorm, MetaMap, and several other Natural Language Processing (NLP) tools for downstream chest X-ray classification using a convolutional neural network. They also released the label set along with the image data.
Later on, Wang et al. (2018) used the same Chest X-ray dataset to further improve the performance of disease classification and report generation from an image. For report generation, Jing et al. (2017)
built a multi-task learning framework, which includes a co-attention mechanism module, and a hierarchical long short term memory (LSTM) module, for radiological image annotation and report paragraph generation.Li et al. (2018) proposed a reinforcement learning-based Hybrid Retrieval-Generation Reinforced Agent (HRGR-Agent) to learn a report generator that can decide whether to retrieve a template or generate a new sentence. Alternatively, Gale et al. (2018) generated interpretable hip fracture X-ray reports by identifying image features and filling text templates.
Finally, Hsu et al. (2018) trained the radiological image and report joint representation through unsupervised alignment of cross-modal embedding spaces for information retrieval.
2.2 Language Generation
Language generation (LG) is a staple of NLP research. LG comes up in the context of neural machine translation, summarization, question answering, image captioning, and more. In all these tasks, the challenges of generating discrete sequences that are realistic, meaningful, and linguistically correct must be met, and the field has devised a number of methods to surmount them. For many years, this was done through ngram-based(Huang et al., 1993) or retrieval-based (Gupta and Lehal, 2010) approaches.
Within the last few years, many have explored the very impressive results of deep learning for text generation.Graves (2013) outlined best practices for RNN-based sequence generation. The following year, Sutskever et al. (2014) introduced the sequence-to-sequence paradigm for machine translation and beyond. However, Wiseman et al. (2017) demonstrated that while RNN-generated texts are often fluent, they have typically failed to reach human-level quality. Alternatively, Rajeswar et al. (2017) and Fedus et al. (2018) have tried using Generative Adversarial Neural Networks (GANs) for text generation. However, Caccia et al. (2018) observed problems with training GANs and show that to date, they are unable to beat canonical sequence decoder methods.
We will also highlight some specific areas of exploration in image captioning, a specific kind of language generation which is conditioned on an image input. The canonical example of this task is realized in the Microsoft COCO (Lin et al., 2014) dataset, which presents a series of images, each annotated with five human-written captions describing the image. The task, then, is to use the image as input to generate a readable, accurate, and linguistically correct caption.
This task has received significant attention with the success of Show and Tell (Vinyals et al., 2015) and its followup Show, Attend, and Tell (Xu et al., 2015). Due to the nature of the COCO competition, other works quickly emerged showing strong results: Yao et al. (2017) used boosting methods, Lu et al. (2017) employed adaptive attention, and Rennie et al. (2017) introduced reinforcement learning as a method for fine-tuning generated text. Devlin et al. (2015) performed surprisingly well using a -nearest neighbor method. They observed that since most of the true captions were simple, one-sentence scene descriptions, there was significant redundancy in the dataset.
2.3 Radiology Report Generation
Multiple recent works have explored our task of radiology report generation. Zhang et al. (2018) used a combination of extractive and abstractive techniques to summarize a radiology report’s findings to generate an impression section. Due to limited text training data, Han et al. (2018) relied on weak supervision for a Recurrent-GAN and template-based framework for MRI report generation. Gale et al. (2018) uses an RNN to generate template-generated text descriptions of pelvic X-rays.
More comparable to this work, Wang et al. (2018) used a CNN-RNN architecture with attention to generate reports that describe chest X-rays. Li et al. (2018) generated chest X-ray reports using reinforcement learning to tune a hierarchical decoder that chooses (for each sentence) whether to use an existing template or to generate a new sentence.
In this work we opt to focus on generating the findings section as it is the most direct annotation from the radiological images. First, we introduce the hierarchical generation strategy with a CNN-RNN-RNN architecture, and later we propose novel improvements that render the generated reports more clinically aligned with the true reports. Full implementation details, including layer sizes, training details, etc., are presented in the Appendix, Section A.
3.1 Hierarchical Generation via CNN-RNN-RNN
As illustrated in Figure 2, we aim to generate a report as a sequence of sentences , where is the number of sentences in a report. Each sentence consists of a sequence of words with words from a vocabulary , where is the number of words in sentence .
The image is fed through the image encoder CNN to obtain a visual feature map. The feature is then taken by the sentence decoder RNN to recurrently generate vectors that represent the topic for each sentence. With the visual feature map and the topic vector, a word decoder RNN tries to generate a sequence of words and attention maps of the visual features. This hierarchical approach is in line with Krause et al. (2017) where they generate descriptive paragraphs for an image.
Image encoder CNN
The input image is passed through a CNN head to obtain the last layer before global pooling. We adjust the feature dimension from to by adding a fully connected layer. The resulting map of spatial image features, each of dimensionality , will be descriptive features for different spatial locations of an image. A mean visual feature is obtained by averaging all local visual features .
Sentence decoder RNN
Given the mean visual feature , we adopt Long-Short Term Memory (LSTM) and model the hidden state as where and are the hidden state vector and the memory vector for the previous time step respectively. From the hidden state , we further generate two components, namely the topic vector and the stop signal , as and , where ’s and ’s are trainable parameters, and
is the sigmoid function. The stop signal acts as as the end-of-sentence token. When, it indicates the sentence decoder RNN should stop generating the next sentence.
Word decoder RNN
After we decode the sentence topics, we can start to decode the words given the topic vector . For simplicity, we drop the subscript as this process applies to all sentences. We adopted the visual sentinel (Lu et al., 2017) to partially look at the feature map
. The hidden states and outputs are again modeled with LSTM, generating the posterior probabilityover the vocabulary with (1) the mean visual feature , (2) the topic vector , and (3) the embedding of the previously generated word , where is the trainable word embedding matrix. At training time, the next word is sampled from the probability , or the -th element of .
This formulation enables the model to look at different parts on the image while having the option of “looking away” at a sentinel vector. Note that this hierarchical encoder-decoder CNN-RNN-RNN architecture is fully differentiable.
3.2 Reinforcement Learning for Readability
As Rennie et al. (2017) showed, the automatic NLG metric CIDEr (Vedantam et al., 2015) is superior to other metrics such as BLEU (Papineni et al., 2002), and ROUGE (Lin, 2004). We consider the case of self-critical sequence training (SCST) (Rennie et al., 2017) that builds on top of the baseline version of the REINFORCE (Williams, 1992) algorithm, and minimize the negative expected reward as a function of the network parameters , as where is the distribution over output spaces, is a metric evaluation function acting as a reward function that takes a sampled report and a ground truth report . The baseline in SCST has been replaced with the reward obtained with testing time greedily decoded report .
3.3 Novel Reward for Clinically Accurate Reinforcement Learning
One major downside with the approach outlined so far, unfortunately, is that in the clinical context, aiming for a good automatic metric such as CIDEr is not enough to correctly characterize the disease states. Negative judgments on diseases are critical components of the reports, by which radiologist indicates that the patient might not have those diseases that were of concern and among the reasons for the examination. Li et al. (2018) indicated that a good portion of chest X-ray reports are heavily templated in patterns such as no pneumothorax or pleural effusion; the lungs are clear; or no focal consolidation, pneumothorax or large pleural effusion. These patterns also suggest that most patients are disease-free, hence the signal of positive mentions of disease will be sparse.
Simply optimizing the automatic LG metrics may misguide the model to mention only the disease names as opposed to correctly positively/negatively describe the disease states. For example, if the ground truth report read no pleural effusion, the models would prefer the text mild pleural effusion over unrelated text or even an empty string, which means intelligent optimization systems could game these metrics at the expense of clinical accuracy.
We hence propose using a Clinically Coherent Reward (CCR), which utilizes a rule-based disease mention annotator , CheXpert (Irvin et al., 2019), to optimize our generated report for clinical efficacy directly. CheXpert performs classification on 12 types of thoracic diseases or X-ray related diagnoses. The mentions for support devices are also labeled. For each label type , there are four possible outcomes for the labeling: (1) positive, (2) negative, (3) uncertain, or (4) absent mention; or, . This outcome can be used to model the positive/negative disease state as . CCR is then defined, dropping the subscripts for distribution for convenience, as
aiming to maximize the correlation of distribution over disease states between the generated text and the ground truth text . Unfortunately, as the true diagnostic state of novel reports is unknown, we need to make several assumptions regarding the performance of the rule based labeler, allowing us to infer the necessary conditional probabilities .
To motivate these assumptions, first note that these diseases are universally rare, or, . Presuming the rule based labeler has any discriminative power, we can thus conclude that if the labeler assigns a negative or an absent label (), For sufficiently rare conditions, a reasonable assumption and simplification is to therefore take and . We further assume that the rule based labeler has a very high precision, and thus . However, given an uncertain mention
, the desired output probabilities are difficult to assess. As such, we define a reward-specific hyperparameter, which in this work we take to be . All of these assumptions could be easily adjusted, but they perform well for us here.
We also wish to use a baseline for the reward . Instead of using a single exponential moving average (EMA) over the total reward, we apply EMA separately to each term as
where is an EMA over updated as .
We wish to pursue both semantic alignment and clinical coherence with the ground truth report, and thus we combine the above rewards for reinforcement learning on our neural network in a weighted fashion. Specifically, , where controls the relative importance.
Hence the derivative of the combined loss with respect to is thus
where is the probability over vocabulary. We can approximate the above gradient with Monte-Carlo samples from and average gradients across training examples in the batch.
MIMIC-CXR is the largest radiology dataset to date and consists of chest X-ray images and reports from patients. Among these images, are of anteroposterior (AP), are of posteroanterior (PA), and are of lateral (LL) views. Furthermore, we eliminate duplicated radiograph images with adjusted brightness level or contrast as they are commonly produced for clinical needs, after which we are left with images and reports. The radiological reports are parsed into sections, among which we extract the findings section. We then apply tokenization and keep tokens with at least occurrences in the corpus, resulting in tokens in total.
Open-I is a public radiography dataset collected by Indiana University with chest X-ray images and reports. The reports are in XML format and include pre-parsed sections. We then exclude the entries without the findings section and are left with images and reports. Tokenization is done similarly, but due to the relatively small size of the corpus, we keep tokens with or more occurrences, ending up with tokens.
Both datasets are partitioned by patients into a train/validation/test ratio of 7/1/2 so that there is no patient overlap between sets. Words that are excluded were replaced by an “unknown” token, and the word embeddings are pretrained separately for each dataset.
4.2 Evaluation Metrics
To compare with other models including prior state-of-the-art and baselines, we adopt several different metrics that focus on different aspects ranging from a natural language perspective to clinical adequacy.
Automatic LG metrics such as CIDEr-D (Vedantam et al., 2015), ROUGE-L (Lin, 2004), and BLEU (Papineni et al., 2002) measure the statistical relation between two text sequences. One concern with such statistical measures is that with a limited scope from the -grams ( up to ) we are unable to capture disease states, as negations are common in the medical corpus and oftentimes the negation cue words and disease words can be far apart in a sentence. As such, we also include medical abnormality detection as a metric. Specifically, we compare the CheXpert (Irvin et al., 2019) labeled annotations between the generated report and the ground truth report on
different categories related to thoracic diseases and support devices. We evaluate the accuracy, precision, and recall for all models.
We compare our methods with state-of-the-art image captioning and medical report generation models as well as some simple baseline models: (a) 1-NN, in which we query a test image for the closet X-ray image in the training set in terms of L2 distance in the image embedding space, and use the corresponding training report as the output text; (b) Show and Tell (S&T) (Vinyals et al., 2015); (c) Show, Attend, and Tell (SA&T) (Xu et al., 2015); and (d) TieNet (Wang et al., 2018). To allow comparable results in all models, we slightly modify previous models to also accept the view position embedding which encodes AP/PA/LL as a one-hot vector to utilize the extra information available at image acquisition. This includes Show and Tell, Show, Attend, and Tell, and our re-implementation of TieNet, which is detailed in Appendix B because the authors did not release their code.
We observed our model to sometimes repeat the findings multiple times. We apply post-hoc processing where we remove exact duplicate sentences in the generated reports. This proves to improve the readability but interestingly slightly degrades NLG metrics.
Additionally, we perform several ablation studies to inspect the contribution of various components of our model. In particular, we assess
- Ours ()
Use only for reinforced learning, as often is the case with the prior state-of-the-art.
- Ours ()
Use only and do not care about aligning the natural language metrics.
- Ours (full)
Considers both rewards as formulated in Section 3.3.
In order to provide some context to the metric scores, we also trained an unsupervised RNN language model which generates free text without conditioning on input radiograph images, which we denote as Noise-RNN. All recurrent models, including prior works and our models, use beam search with a beam size of .
5 Results & Discussion
5.1 Quantitative Results
Natural Language Metrics
In Table 2 we show the automatic evaluation scores for baseline models, prior works, and variants of our models on the aforementioned test sets. Ours (), that solely optimizes CIDEr score, achieves superior performance in terms of natural language metrics, but its clinical meaningfulness is not significantly above the major class in which we predict all patients to be disease-free. This phenomenon is common among all other models that do not consider clinical alignment between the ground truth and the generated reports. On the other hand, in our full model, if we consider both natural language and clinical coherence, we can achieve the highest clinical disease annotation accuracy while still retaining decently high NLG metrics.
We also conducted the ablation study with the model variant Ours (), where we use reinforcement learning on only the clinical accuracy. It is clear that we are unable to achieve higher clinical coherence, though readability might be sacrificed. We thus conclude that a combination of both NLG metrics and a clinically sensible objective is crucial in training a useful model in clinical practice.
One thing to note is that although Noise-RNN is not dependent on the image, its NLG metrics, especially ROUGE, are not far off from models learned with supervision. We also note that MIMIC-CXR is better for training such an encoder-decoder model not just for its larger volume of data, but also due to its higher proportion of positive disease annotations at while Open-I only has . This discrepancy leads to a 156 times increase in the number of images from diseased patients.
Clinical Efficacy Metrics
|Label||Count||1-NN||S&T||SA&T||TieNet||Ours ()||Ours ()||Ours (full)|
In Table 3 we can compare the labels annotated by CheXpert calculated over all test set generated reports. Note that the labeling process generates discrete binary label as opposed to predicting continuous probabilities, and as such we are unable to obtain discriminative metrics such as the Area Under the Receiver Operator Characteristic (AUROC) or the Area Under the Precision-Recall Curve (AUPRC). Precision-wise, Ours () achieves the highest overall scores including macro-average and micro-average. The runner-up is Ours (full) model, which additionally considers language fluency. Note that the macro- metrics can be quite noisy as the per-class metric can be dependent on just a few examples. Many entries in the table are zeros, as they never yield positive predictions and we regard them as zeros to penalize such behavior. Regarding the recall metric, we are able to see a substantial drop in Ours () and Ours (full) as a result of optimizing for accuracy. Accuracy is closely associated with precision but overpursuing it might harm in terms of recall. It is worthwhile to notice that the nearest neighbor 1-NN has the highest recall, and this is no surprise since as shown before (Hendrik Strobelt, ), generated sequences tend to follow the statistics and favor common words too much. Rare combinations of tokens in the corpus can be easily neglected, resulting in predictions of mostly major classes. Typically in language generation, we do not and are unable to re-weight rare occurrences as we often can do in the image classification task.
5.2 Qualitative Results
Evaluation of Generated Reports
|Ground Truth||TieNet||Ours (full)|
|cardiomegaly is moderate. bibasilar atelectasis is mild. there is no pneumothorax. a lower cervical spinal fusion is partially visualized. healed right rib fractures are incidentally noted.||ap portable upright view of the chest. there is no focal consolidation, effusion, or pneumothorax. the cardiomediastinal silhouette is normal. imaged osseous structures are intact.||pa and lateral views of the chest. there is mild enlargement of the cardiac silhouette. there is no pleural effusion or pneumothorax. there is no acute osseous abnormalities.|
|as compared to the previous radiograph, the monitoring and support devices are unchanged. unchanged bilateral pleural effusions, with a tendency to increase, and resultant areas of atelectasis. the air collection in the bilateral soft tissues is slightly decreased. unchanged right picc line. no definite evidence of pneumothorax.||as compared to the previous radiograph, the patient has received a nasogastric tube. the course of the tube is unremarkable, the tip of the tube projects over the middle parts of the stomach. there is no evidence of complication, notably no pneumothorax. the other monitoring and support devices are constant. constant appearance of the cardiac silhouette and of the lung parenchyma.||as compared to the previous radiograph, there is no relevant change. tracheostomy tube is in place. there is a layering pleural effusions. NAME bilateral pleural effusion and compressive atelectasis at the right base. there is no pneumothorax.|
|the course of the dobbhoff feeding catheter is unremarkable, and its tip is seen projecting over the stomach. there is no evidence of complications, specifically no pneumothorax. as compared to the prior radiograph dated DATE, there has been no other significant interval change.||ap portable upright view of the chest. overlying ekg leads are present. there is no focal consolidation, effusion, or pneumothorax. the cardiomediastinal silhouette is normal. imaged osseous structures are intact.||as compared to the previous radiograph, there is no relevant change. the endotracheal tube terminates approximately 3 cm above the NAME. the endotracheal tube extends into the stomach. there is no evidence of complications, notably no pneumothorax. there is no pleural effusion or pneumothorax.|
|interval placement of a left basilar pigtail chest tube with improving aeration in the left mid to lower lung and near complete resolution of the pleural effusion. there are residual patchy opacities within the left mid and lower lung as well as at the right base favoring resolving atelectasis. no pneumothorax is appreciated on this semi upright study. heart remains stably enlarged. mediastinal contours are stably widened, although this NAME be related to portable technique and positioning. this can be better evaluated on followup imaging. no pulmonary edema.||as compared to the previous radiograph, the patient has been extubated. the nasogastric tube is in unchanged position. the lung volumes remain low. moderate cardiomegaly with minimal fluid overload but no overt pulmonary edema. no larger pleural effusions. no pneumonia.||ap upright and lateral views of the chest. there is moderate cardiomegaly. there is no pleural effusion or pneumothorax. there is no acute osseous abnormalities.|
Table 4 demonstrates the qualitative results of our full model. In general, our model is able to generate descriptions that align with the logical flow of reports written by radiologists, which start from general information (such as views, previous comparison), positive, then negative findings, with the order of lung, heart, pleura, and others. TieNet also generates report descriptions with such logical flow but in slightly different orders. For the negative findings cases, both our model and TieNet do well on generating reasonable descriptions without significant errors. Regarding the cases with positive findings, TieNet and our full model both cannot identify all radiological findings. Our full model is able to identify the major finding in each demonstrated case. For example, cardiomegaly in the first case, pleural effusion, and atelectasis in the second case.
A formerly practicing clinician co-author reviewed a larger subset of our generated reports manually. They drew several conclusions. First, our full model tends to generate sentences related to pleural effusion, atelectasis, and cardiomegaly correctly—which is aligned with the clinical finding scores in Table 3. TieNet instead misses some positive findings in such cases. Second, there are significant issues in all generated reports, regardless of the source model, which include the description of supportive lines and tubes, as well as lung lesions. For example, TieNet is prone to generate nasogastric tube mentions while our model tends to mention tracheostomy or endotracheal tube, and yet both models have difficulty identifying some specific lines such as chest tube or PICC line. Similarly, both systems do not generate the sentence with positive lung parenchymal findings correctly.
From this (small) sample, we are unable to draw a conclusion whether our model or TieNet truly outperforms the other since both present with significant issues and each has strengths the other lacks. Critically, neither of them can describe the majority of the findings in the chest radiograph well, especially for positive cases, even if the quantitative metrics demonstrate the reasonable performance of the models. This illustrates that significant progress is still needed in this domain, perhaps building on the directions we explore here before these techniques could be deployed in a clinical environment.
Learning Meaningful Attention Maps
Attention maps have been a useful tool in visualizing what a neural network is attending to, as demonstrated by Rajpurkar et al. (2017). Figure 3 shows the intermediate attention maps for each word when it is being generated. As we can observe, the model is able to roughly capture the location of the indicated disease or parts, but we also find, interestingly, that the attention map tends to be the complement of the actual region of interest when the disease keywords follow a negation cue word. This might indicate that the model is actively looking at the rest of the image to ensure it does not miss any possible symptoms exhibited before asserting disease-free states. This behavior has not been widely discussed before, partially because attention maps for negations are not the primary focus of typical image captioning tasks, and most attention mechanisms employed in a clinical context were on classification tasks where negation is formulated differently.
6.1 Limitations & Future Work
Our work has several notable limitations and opportunities for future work. First and foremost, the post-processing step required to remove repeated sentences is an ugly necessity, and we endeavor to remove it in future iterations of this work. Promising techniques exist in NLG for the inclusion of greater diversity, which warrant further investigation here.
Secondly, our model operates using images in isolation, without consideration of whether these images are part of a series of ordered radiographs for a single patient, which might be summarized together. Using all available information has the potential to improve the quality of the generated reports, and should definitely be investigated further.
Lastly, we note that though our model yields very strong performance for CheXpert precision, its recall is much worse. Recall versus precision is favored to different degrees in differing clinical contexts. For example, for screening purpose, recall (sensitivity) is an ideal metric since the healthy cases usually won’t give positive findings. However, precision (positive predictive value) is much more critical for validating the clinical impression, which is common in an ICU setting where patients receive a radiological study on the basis of strong clinical suspicion. We believe that our system’s poor recall is a direct result of the setup of our RL models and the CCR reward, which optimizes for accuracy and inherently boosts precision. It is the choice of optimization objectives that lead to the results. Depending on the actual clinical applications, we may, in turn, optimize Recall at Fixed Precision (R@P) or score via methods described by Eban et al. (2016).
6.2 Reflections on Trends in the Field
In the course of this work, we also encounter several other larger points which are present not only in our study but also in many related studies in this domain and warrant further thought by the community.
CheXpert used in our models is rule-based, which is harder to generalize to other datasets and to identify the implicit features inside the language patterns. CheXpert is also specialized in English and would require considerable work to re-code its rules for other natural languages. A more universal approach for subsequent research may use a learning-based approach for labeling to improve generalizability and extend to corpora in different languages; for example, PadChest in Spanish.
Be Careful What You Wish For
NLG metrics are known to be only limited substitutes for a true assessment of readability (Kilickaya et al., 2016; Liu et al., 2016). For radiology reports more specifically, this problem is even more profound, as prior works often use “readability” as a proxy for clinical efficacy. Additionally, we note that these NLG evaluation metrics are easily susceptible to gaming. In our results, our post-processing step of removing exact duplicates actually worsens our CIDEr score, which is the opposite of what should be desired for an NLG evaluation metric. Even if our proposed clinical coherence aims at resolving the unwanted misalignment between NLG and real practice, we are not able to obviously judge whether our system is better despite its performance on paper. This fact is especially troubling given the increasing trend of using reinforcement learning (RL) to directly optimize objectives, as has been done in prior work (Li et al., 2018) and as we do here. Though RL can offer marked improvements in these automatic metrics, which are currently the best the field can do, how well it translates to the real clinical efficacy is unclear. The careful design of improved evaluation metrics, specifically for radiology report generation, should be a prime focus for the field going forward.
In this work, we develop a chest X-Ray radiology report generation system which hierarchically generates topics from images, then words from topics. This structure gives the model the ability to use largely templated sentences (through the generation of similar topic vectors) while preserving its freedom to generate diverse text. The final system is also optimized with reinforcement learning for both readability (via CIDEr) and clinical correctness (via the novel Clinically Coherent Reward). Our system outperforms a variety of compelling baseline methods across readability and clinical efficacy metrics on both MIMIC-CXR and Open-I datasets.
Dr. Marzyeh Ghassemi is partially funded by a CIFAR AI Chair at the Vector Institute, and an NSERC Discovery Grant.
- Aronson and Lang (2010) Alan R Aronson and François-Michel Lang. An overview of metamap: historical perspective and recent advances. Journal of the American Medical Informatics Association, 17(3):229–236, 2010.
- Bustos et al. (2019) Aurelia Bustos, Antonio Pertusa, Jose-Maria Salinas, and Maria de la Iglesia-Vayá. Padchest: A large chest x-ray image dataset with multi-label annotated reports. arXiv preprint arXiv:1901.07441, 2019.
- Caccia et al. (2018) Massimo Caccia, Lucas Caccia, William Fedus, Hugo Larochelle, Joelle Pineau, and Laurent Charlin. Language gans falling short. arXiv preprint arXiv:1811.02549, 2018.
- Demner-Fushman et al. (2015) Dina Demner-Fushman, Marc D Kohli, Marc B Rosenman, Sonya E Shooshan, Laritza Rodriguez, Sameer Antani, George R Thoma, and Clement J McDonald. Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association, 23(2):304–310, 2015.
- Devlin et al. (2015) Jacob Devlin, Saurabh Gupta, Ross Girshick, Margaret Mitchell, and C Lawrence Zitnick. Exploring nearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467, 2015.
- Eban et al. (2016) Elad ET Eban, Mariano Schain, Alan Mackey, Ariel Gordon, Rif A Saurous, and Gal Elidan. Scalable learning of non-decomposable objectives. arXiv preprint arXiv:1608.04802, 2016.
- Fedus et al. (2018) William Fedus, Ian Goodfellow, and Andrew M Dai. Maskgan: Better text generation via filling in the_. arXiv preprint arXiv:1801.07736, 2018.
- Gale et al. (2018) William Gale, Luke Oakden-Rayner, Gustavo Carneiro, Andrew P Bradley, and Lyle J Palmer. Producing radiologist-quality reports for interpretable artificial intelligence. arXiv preprint arXiv:1806.00340, 2018.
- Graves (2013) Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
Gupta and Lehal (2010)
Vishal Gupta and Gurpreet Singh Lehal.
A survey of text summarization extractive techniques.Journal of emerging technologies in web intelligence, 2(3):258–268, 2010.
- Han et al. (2018) Zhongyi Han, Benzheng Wei, Stephanie Leung, Jonathan Chung, and Shuo Li. Towards automatic report generation in spine radiology using weakly supervised framework. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 185–193. Springer, 2018.
He et al. (2016)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- (13) Sebastian Gehrmann Hendrik Strobelt. Giant language model test room. URL http://gltr.io/dist/index.html.
- Hsu et al. (2018) Tzu-Ming Harry Hsu, Wei-Hung Weng, Willie Boag, Matthew McDermott, and Peter Szolovits. Unsupervised multimodal representation learning across medical images and reports. arXiv preprint arXiv:1811.08615, 2018.
- Huang et al. (1993) Xuedong Huang, Fileno Alleva, Hsiao-Wuen Hon, Mei-Yuh Hwang, Kai-Fu Lee, and Ronald Rosenfeld. The sphinx-ii speech recognition system: an overview. Computer Speech & Language, 7(2):137–148, 1993.
- Iandola et al. (2014) Forrest Iandola, Matt Moskewicz, Sergey Karayev, Ross Girshick, Trevor Darrell, and Kurt Keutzer. Densenet: Implementing efficient convnet descriptor pyramids. arXiv preprint arXiv:1404.1869, 2014.
- Irvin et al. (2019) Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. arXiv preprint arXiv:1901.07031, 2019.
- Jing et al. (2017) Baoyu Jing, Pengtao Xie, and Eric Xing. On the automatic generation of medical imaging reports. arXiv preprint arXiv:1711.08195, 2017.
- Johnson et al. (2019) Alistair EW Johnson, Tom J Pollard, Seth Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr: A large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042, 2019.
- Kahn Jr et al. (2009) Charles E Kahn Jr, Curtis P Langlotz, Elizabeth S Burnside, John A Carrino, David S Channin, David M Hovsepian, and Daniel L Rubin. Toward best practices in radiology reporting. Radiology, 252(3):852–856, 2009.
- Kilickaya et al. (2016) Mert Kilickaya, Aykut Erdem, Nazli Ikizler-Cinbis, and Erkut Erdem. Re-evaluating automatic metrics for image captioning. arXiv preprint arXiv:1612.07600, 2016.
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Krause et al. (2017) Jonathan Krause, Justin Johnson, Ranjay Krishna, and Li Fei-Fei. A hierarchical approach for generating descriptive image paragraphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 317–325, 2017.
- Leaman et al. (2015) Robert Leaman, Ritu Khare, and Zhiyong Lu. Challenges in clinical natural language processing for automated disorder normalization. Journal of biomedical informatics, 57:28–37, 2015.
- Li et al. (2018) Yuan Li, Xiaodan Liang, Zhiting Hu, and Eric P Xing. Hybrid retrieval-generation reinforced agent for medical image report generation. In Advances in Neural Information Processing Systems, pages 1537–1547, 2018.
- Lin (2004) Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out, 2004.
- Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and Larry Zitnick. Microsoft coco: Common objects in context. In ECCV. European Conference on Computer Vision, September 2014. URL https://www.microsoft.com/en-us/research/publication/microsoft-coco-common-objects-in-context/.
- Litjens et al. (2017) Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen Awm Van Der Laak, Bram Van Ginneken, and Clara I Sánchez. A survey on deep learning in medical image analysis. Medical image analysis, 42:60–88, 2017.
- Liu et al. (2016) Chia-Wei Liu, Ryan Lowe, Iulian V Serban, Michael Noseworthy, Laurent Charlin, and Joelle Pineau. How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv preprint arXiv:1603.08023, 2016.
- Lu et al. (2017) Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 375–383, 2017.
- Moradi et al. (2018) Mehdi Moradi, Ali Madani, Yaniv Gur, Yufan Guo, and Tanveer Syeda-Mahmood. Bimodal network architectures for automatic generation of image annotation from text. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 449–456. Springer, 2018.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics, 2002.
Paszke et al. (2017)
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary
DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.
Automatic differentiation in pytorch.2017.
- Rajeswar et al. (2017) Sai Rajeswar, Sandeep Subramanian, Francis Dutil, Christopher Pal, and Aaron Courville. Adversarial generation of natural language. arXiv preprint arXiv:1705.10929, 2017.
- Rajpurkar et al. (2017) Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya, et al. CheXnet: Radiologist-level pneumonia detection on chest X-rays with deep learning. arXiv preprint arXiv:1711.05225, 2017.
- Rehurek and Sojka (2010) Radim Rehurek and Petr Sojka. Software framework for topic modelling with large corpora. In In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. Citeseer, 2010.
- Rennie et al. (2017) Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7008–7024, 2017.
- Rubin (2015) Geoffrey D Rubin. Lung nodule and cancer detection in ct screening. Journal of thoracic imaging, 30(2):130, 2015.
- Rubin et al. (2018) Jonathan Rubin, Deepan Sanghavi, Claire Zhao, Kathy Lee, Ashequl Qadir, and Minnan Xu-Wilson. Large scale automated reading of frontal and lateral chest x-rays using dual convolutional neural networks. arXiv preprint arXiv:1804.07839, 2018.
- Schlegl et al. (2015) Thomas Schlegl, Sebastian M Waldstein, Wolf-Dieter Vogl, Ursula Schmidt-Erfurth, and Georg Langs. Predicting semantic descriptions from medical images with convolutional neural networks. In International Conference on Information Processing in Medical Imaging, pages 437–448. Springer, 2015.
- Schwartz et al. (2011) Lawrence H Schwartz, David M Panicek, Alexandra R Berk, Yuelin Li, and Hedvig Hricak. Improving communication of diagnostic radiology findings through structured reporting. Radiology, 260(1):174–181, 2011.
- Shin et al. (2015) Hoo-Chang Shin, Le Lu, Lauren Kim, Ari Seff, Jianhua Yao, and Ronald M Summers. Interleaved text/image deep mining on a very large-scale radiology database. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1090–1099, 2015.
- Shin et al. (2016) Hoo-Chang Shin, Kirk Roberts, Le Lu, Dina Demner-Fushman, Jianhua Yao, and Ronald M Summers. Learning to read chest X-rays: Recurrent neural cascade model for automated image annotation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2497–2506, 2016.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
- Vedantam et al. (2015) Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
- Vinyals et al. (2015) Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015.
- Wang et al. (2016) Xiaosong Wang, Le Lu, Hoo-chang Shin, Lauren Kim, Isabella Nogues, Jianhua Yao, and Ronald Summers. Unsupervised category discovery via looped deep pseudo-task optimization using a large scale radiology image database. arXiv preprint arXiv:1603.07965, 2016.
- Wang et al. (2017) Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3462–3471. IEEE, 2017.
- Wang et al. (2018) Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, and Ronald M Summers. TieNet: Text-image embedding network for common thorax disease classification and reporting in chest X-rays. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9049–9058, 2018.
- Williams (1992) Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
- Wiseman et al. (2017) Sam Wiseman, Stuart M Shieber, and Alexander M Rush. Challenges in data-to-document generation. arXiv preprint arXiv:1707.08052, 2017.
- Xu et al. (2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057, 2015.
- Yao et al. (2017) Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision, pages 4894–4902, 2017.
- Zhang et al. (2018) Yuhao Zhang, Daisy Yi Ding, Tianpei Qian, Christopher D Manning, and Curtis P Langlotz. Learning to summarize radiology findings. arXiv preprint arXiv:1809.04698, 2018.
Appendix A Implementation Details
We briefly describe the details of our implementation in this section.
The image encoder CNN takes an input image of size . The last layer before global pooling in a DenseNet-121 are extracted, which has a dimension of , and thus and . Densenet-121 (Iandola et al., 2014) has been shown to be state-of-the-art in the context of classification for clinical images. The image features are then projected to dimensions with a dropout of .
Since typically in the X-ray image acquisition we are provided with the view position indicating the posture of the patient related to the machine, we conveniently pass this into the model as well. Indicated by a one-hot vector, the view position embedding is concatenated with the image embedding to form an input to the later decoders.
As previously mentioned, the input image embedding to the LSTM has a dimension of , and it is the same for word embeddings and hidden layer sizes. The word embedding matrix is pretrained with Gensim (Rehurek and Sojka, 2010) in an unsupervised manner.
We implement our model on PyTorch (Paszke et al., 2017) and train on 4 GeForce GTX TITAN X GPUs. All models are first trained with cross-entropy loss with the Adam (Kingma and Ba, 2014) optimizer using an initial learning rate of and a batch size of for epochs. Other than the weights stated above, the models are initialized randomly. Learning rates are annealed by every epochs and we increase the probability of feeding back a sample from the posterior by every epochs. After this bootstrapping stage, we start training with REINFORCE for another epochs. The initial learning rate for the second stage is and is annealed on the same schedule.
Appendix B TieNet Re-implementation
Since the implementation for TieNet (Wang et al., 2018) is not released, we re-implement it with the descriptions provided by the original authors. The re-implementation details are described in this section.
TieNet stands for Text-Image Embedding Network. It consists of three main components: image encoder, sentence decoder with Attention Network, and Joint Learning Network. It computes a global attention encoded text embedding using hidden states from a sentence decoder and saliency weighted global average pooling using attention maps from the attention network. The two global representations are combined as an input to the joint learning network. Finally, it outputs the multi-label classification of thoracic diseases. The end products are automatic report generation for medical images and classification of thoracic diseases.
An image of size is taken by the image encoder CNN as an input. The last two layers of ResNet-101 (He et al., 2016) are removed since we are not classifying the image. The final encoding produced has a size of . We also fine-tune convolutional blocks conv2 through conv4 of our image encoder during training time.
We also include the view position information by concatenating the view position embedding with the image embedding to form input. The view position embedding is indicated by a one-hot vector. At each decoding step, the encoded image and the previous hidden state with a dropout of is used to generate weights for each pixel in the attention network. The previously generated word and the output from the attention network are fed to the LSTM decoder to generate the next word.
Joint Learning Network
TieNet proposed an additional component to automatically classify and report thoracic diseases. The joint learning network takes hidden states and attention maps from the decoder and computes global representations for report and images, then combines the result as the input to a fully connected layer to output disease labels.
In the original paper, indicates the number of attention heads, which we set as ; is the hidden size for attention generation, which we set as . One key difference from the original work is that we are classifying the joint embeddings into CheXpert (Irvin et al., 2019) annotated labels, and hence we have the class count . The disease classification cross-entropy loss and the teacher-forcing report generation loss are combined as , in which is the loss for which the network optimizes. However, the value was not disclosed in the original work and we use .
We implement TieNet on PyTorch (Paszke et al., 2017) and train on GeForce GTX TITAN X GPUs. The decoder is trained with cross-entropy loss with the Adam (Kingma and Ba, 2014) optimizer using an initial learning rate of and a mini-batch size of for epochs. Learning rate for the decoder is decayed by a factor of if there is no improvement of BLEU (Papineni et al., 2002) score on the development set in consecutive epochs. The joint learning network is trained with sigmoid binary cross-entropy loss with the Adam (Kingma and Ba, 2014) optimizer using a constant learning rate of .
Since we are not able to access the original implementation of TieNet and we additionally inject view position information to the model, we might have small variations in result between the original paper and our re-implementation. We only compare the report generation part of TieNet to our model.