Clinically Accurate Chest X-Ray Report Generation

04/04/2019 ∙ by Guanxiong Liu, et al. ∙ UNIVERSITY OF TORONTO MIT 0

The automatic generation of radiology reports given medical radiographs has significant potential to operationally and clinically improve patient care. A number of prior works have focused on this problem, employing advanced methods from computer vision and natural language generation to produce readable reports. However, these works often fail to account for the particular nuances of the radiology domain, and, in particular, the critical importance of clinical accuracy in the resulting generated reports. In this work, we present a domain-aware automatic chest X-Ray radiology report generation system which first predicts what topics will be discussed in the report, then conditionally generates sentences corresponding to these topics. The resulting system is fine-tuned using reinforcement learning, considering both readability and clinical accuracy, as assessed by the proposed Clinically Coherent Reward. We verify this system on two datasets, Open-I and MIMIC-CXR, and demonstrate that our model offers marked improvements on both language generation metrics and CheXpert assessed accuracy over a variety of competitive baselines.



There are no comments yet.


page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A critical task in radiology practice is the generation of a free-text description, or report, based on a clinical radiograph (e.g., a chest X-ray). Providing automated support for this task has the potential to ease clinical workflows and improve both the quality and standardization of care. However, this process poses significant technical challenges. Many traditional image captioning approaches are designed to produce far shorter and less complex pieces of text than radiology reports. Further, these approaches do not capitalize on the highly templated nature of radiology reports. Additionally, generic natural language generation (NLG) methods prioritize descriptive accuracy only as a byproduct of readability, whereas providing an accurate clinical description of the radiograph is the first priority of the report. Prior works in this domain have partially addressed these issues, but significant gaps remain towards producing high-quality reports with maximal clinical efficacy.

In this work, we take steps to address these gaps through our novel automatic chest X-ray radiology report generation system. Our model hierarchically generates a sequence of unconstrained topics, using each topic to generate a sentence for the final generated report. In this way, we capitalize on the often-templated nature of radiology reports while simultaneously offering the system sufficient freedom to generate diverse, free-form reports. The system is finally tuned via reinforcement learning to optimize readability (via the CIDEr score) as well as clinical accuracy (via the concordance of CheXpert (Irvin et al., 2019) disease state labels between the ground truth and generated reports). We test this system on the MIMIC-CXR (Johnson et al., 2019)

dataset, which is the largest paired image-report dataset presently available, and demonstrate that our model offers improvements on both NLG evaluation metrics (BLEU 

(Papineni et al., 2002), CIDEr (Vedantam et al., 2015), and ROGUE (Lin, 2004)) and clinical efficacy metrics (CheXpert concordance) over several compelling baseline models, including a re-implementation of TieNet (Wang et al., 2018), simpler neural baselines, and a retrieval-based baseline.

Clinical Relevance

This work focuses on generating a clinically useful radiology report from a chest X-ray for clinicians. This task has been explored multiple times, but directly transplanting natural language generation techniques into the field only guarantees the reports to look real rather than to predict right. A more immediate focus for the report generation task is thus to produce accurate disease profiles to power downstream tasks such as diagnosis. Our goal is then minding the language fluency while also increasing the clinical efficacy of the generated reports.

Technical Significance

We employ a hierarchical convolutional-recurrent neural network as the backbone for our proposed method. Reinforcement learning (RL) on a combined objective of both language fluency metrics and the proposed Clinically Coherent Reward ensures we obtain a quality model on more correctly describing disease states. Our method aims to numerically align the disease labels of our generated report, as produced by a natural language labeler, with the labels from the ground truth reports. The reward function, though non-differentiable, is able to be optimized through policy gradient learning as promised by RL.

2 Background & Related Work

2.1 Radiology

Figure 1: A chest X-ray and its associated report written by a radiologist.

Radiology Practice

Diagnostic radiology is the medical field of creating and evaluating radiological images (radiographs) of patients for the purposes of diagnostics. Radiologists are trained to simultaneously identify various radiological findings (e.g., diseases), according to the details of the radiograph and the patient’s clinical history, then summarize these findings and their overall impression in reports for clinical communication (Kahn Jr et al., 2009; Schwartz et al., 2011). A report typically consists of sections such as history, examination reason, findings, and impressions. As shown in Figure 1, the findings section contains a sequence of positive, negative, or uncertain mentions of either disease observations or instruments including their detailed location and severity. The impression section, by contrast, summarizes diagnoses considering all report sections above and previous studies on the patient.

Correctly identifying all abnormalities is a challenging task due to high variation, atypical cases, and the information overload inherent to some imaging modalities, such as computerized tomography (CT) scans (Rubin, 2015)

. This presents a strong intervention surface for machine learning techniques to help radiologists correctly identify the critical findings from a radiograph. The canonical way to communicate such findings in current practice would be through the free-text report, which could either be used as a “draft” report for the radiologists to extend, or be presented to the physician requesting a radiological study directly 

(Schwartz et al., 2011).

AI on Radiology Data

Dataset Source Institution Disease Labeling # Images # Reports # Patients
Open-I Indiana Network for Patient Care Expert 8,121 3,996 3,996
Chest-Xray8 National Institutes of Health Automatic
(DNorm + MetaMap) 108,948 0 32,717
CheXpert Stanford Hospital Automatic
(CheXpert labeler) 224,316 0 65,240
PadChest Hospital Universitario de San Juan Expert + Automatic
(Neural network) 160,868 206,222 67,625
MIMIC-CXR Beth Israel Deacones Medical Center Automatic
(CheXpert labeler) 473,057 206,563 63,478
Table 1: A description of each available chest X-ray datasets. Open-I (Demner-Fushman et al., 2015), Chest-XRay8 (Wang et al., 2017) which utilized DNorm (Leaman et al., 2015) and MetaMap (Aronson and Lang, 2010), CheXpert (Irvin et al., 2019), PadChest (Bustos et al., 2019), and MIMIC-CXR (Johnson et al., 2019).

In recent years, several chest radiograph datasets, totalling almost a million X-ray images, have been made publicly available. A summary of these datasets is available in Table 1. Learning effective computational models through leveraging the information in medical images and free-text reports is an emerging field. Such a combination of image and textual data help further improve the model performance in both image annotation and automatic report generation (Litjens et al., 2017).

Schlegl et al. (2015)

first proposed a weakly supervised learning approach to utilize semantic descriptions in reports as labels for better classifying the tissue patterns in optical coherence tomography (OCT) imaging. In the field of radiology,

Shin et al. (2016) proposed a convolutional and recurrent network framework that jointly trained from image and text to annotate disease, anatomy, and severity in the chest X-ray images. Similarly, Moradi et al. (2018) jointly processed image and text signals to produce regions of interest over chest X-ray images. Rubin et al. (2018) trained a convolutional network to predict common thoracic diseases given chest X-ray images. Shin et al. (2015), Wang et al. (2016), and Wang et al. (2017)

mined radiological reports to create disease and symptom concepts as labels. They first used Latent Dirichlet Allocation (LDA) to identify the topics for clustering, then applied the disease detection tools such as DNorm, MetaMap, and several other Natural Language Processing (NLP) tools for downstream chest X-ray classification using a convolutional neural network. They also released the label set along with the image data.

Later on, Wang et al. (2018) used the same Chest X-ray dataset to further improve the performance of disease classification and report generation from an image. For report generation, Jing et al. (2017)

built a multi-task learning framework, which includes a co-attention mechanism module, and a hierarchical long short term memory (LSTM) module, for radiological image annotation and report paragraph generation.

Li et al. (2018) proposed a reinforcement learning-based Hybrid Retrieval-Generation Reinforced Agent (HRGR-Agent) to learn a report generator that can decide whether to retrieve a template or generate a new sentence. Alternatively, Gale et al. (2018) generated interpretable hip fracture X-ray reports by identifying image features and filling text templates.

Finally, Hsu et al. (2018) trained the radiological image and report joint representation through unsupervised alignment of cross-modal embedding spaces for information retrieval.

2.2 Language Generation

Language generation (LG) is a staple of NLP research. LG comes up in the context of neural machine translation, summarization, question answering, image captioning, and more. In all these tasks, the challenges of generating discrete sequences that are realistic, meaningful, and linguistically correct must be met, and the field has devised a number of methods to surmount them. For many years, this was done through ngram-based 

(Huang et al., 1993) or retrieval-based (Gupta and Lehal, 2010) approaches.

Within the last few years, many have explored the very impressive results of deep learning for text generation.

Graves (2013) outlined best practices for RNN-based sequence generation. The following year, Sutskever et al. (2014) introduced the sequence-to-sequence paradigm for machine translation and beyond. However, Wiseman et al. (2017) demonstrated that while RNN-generated texts are often fluent, they have typically failed to reach human-level quality. Alternatively, Rajeswar et al. (2017) and Fedus et al. (2018) have tried using Generative Adversarial Neural Networks (GANs) for text generation. However, Caccia et al. (2018) observed problems with training GANs and show that to date, they are unable to beat canonical sequence decoder methods.

Image Captioning

We will also highlight some specific areas of exploration in image captioning, a specific kind of language generation which is conditioned on an image input. The canonical example of this task is realized in the Microsoft COCO (Lin et al., 2014) dataset, which presents a series of images, each annotated with five human-written captions describing the image. The task, then, is to use the image as input to generate a readable, accurate, and linguistically correct caption.

This task has received significant attention with the success of Show and Tell (Vinyals et al., 2015) and its followup Show, Attend, and Tell (Xu et al., 2015). Due to the nature of the COCO competition, other works quickly emerged showing strong results: Yao et al. (2017) used boosting methods, Lu et al. (2017) employed adaptive attention, and Rennie et al. (2017) introduced reinforcement learning as a method for fine-tuning generated text. Devlin et al. (2015) performed surprisingly well using a -nearest neighbor method. They observed that since most of the true captions were simple, one-sentence scene descriptions, there was significant redundancy in the dataset.

2.3 Radiology Report Generation

Multiple recent works have explored our task of radiology report generation. Zhang et al. (2018) used a combination of extractive and abstractive techniques to summarize a radiology report’s findings to generate an impression section. Due to limited text training data, Han et al. (2018) relied on weak supervision for a Recurrent-GAN and template-based framework for MRI report generation. Gale et al. (2018) uses an RNN to generate template-generated text descriptions of pelvic X-rays.

More comparable to this work, Wang et al. (2018) used a CNN-RNN architecture with attention to generate reports that describe chest X-rays. Li et al. (2018) generated chest X-ray reports using reinforcement learning to tune a hierarchical decoder that chooses (for each sentence) whether to use an existing template or to generate a new sentence.

3 Methods

Figure 2: The model for our proposed Clinically Coherent Reward. Images are first encoded into image embedding maps, and a sentence decoder takes the pooled embedding to recurrently generate topics for sentences. The word decoder then generates the sequence from the topic with attention on the original images. NLG reward, clinically coherent reward, or combined, can then be applied as the reward for reinforcement policy learning.

In this work we opt to focus on generating the findings section as it is the most direct annotation from the radiological images. First, we introduce the hierarchical generation strategy with a CNN-RNN-RNN architecture, and later we propose novel improvements that render the generated reports more clinically aligned with the true reports. Full implementation details, including layer sizes, training details, etc., are presented in the Appendix, Section A.

3.1 Hierarchical Generation via CNN-RNN-RNN

As illustrated in Figure 2, we aim to generate a report as a sequence of sentences , where is the number of sentences in a report. Each sentence consists of a sequence of words with words from a vocabulary , where is the number of words in sentence .

The image is fed through the image encoder CNN to obtain a visual feature map. The feature is then taken by the sentence decoder RNN to recurrently generate vectors that represent the topic for each sentence. With the visual feature map and the topic vector, a word decoder RNN tries to generate a sequence of words and attention maps of the visual features. This hierarchical approach is in line with Krause et al. (2017) where they generate descriptive paragraphs for an image.

Image encoder CNN

The input image is passed through a CNN head to obtain the last layer before global pooling. We adjust the feature dimension from to by adding a fully connected layer. The resulting map of spatial image features, each of dimensionality , will be descriptive features for different spatial locations of an image. A mean visual feature is obtained by averaging all local visual features .

Sentence decoder RNN

Given the mean visual feature , we adopt Long-Short Term Memory (LSTM) and model the hidden state as where and are the hidden state vector and the memory vector for the previous time step respectively. From the hidden state , we further generate two components, namely the topic vector and the stop signal , as and , where ’s and ’s are trainable parameters, and

is the sigmoid function. The stop signal acts as as the end-of-sentence token. When

, it indicates the sentence decoder RNN should stop generating the next sentence.

Word decoder RNN

After we decode the sentence topics, we can start to decode the words given the topic vector . For simplicity, we drop the subscript as this process applies to all sentences. We adopted the visual sentinel (Lu et al., 2017) to partially look at the feature map

. The hidden states and outputs are again modeled with LSTM, generating the posterior probability

over the vocabulary with (1) the mean visual feature , (2) the topic vector , and (3) the embedding of the previously generated word , where is the trainable word embedding matrix. At training time, the next word is sampled from the probability , or the -th element of .

This formulation enables the model to look at different parts on the image while having the option of “looking away” at a sentinel vector. Note that this hierarchical encoder-decoder CNN-RNN-RNN architecture is fully differentiable.

3.2 Reinforcement Learning for Readability

As Rennie et al. (2017) showed, the automatic NLG metric CIDEr (Vedantam et al., 2015) is superior to other metrics such as BLEU (Papineni et al., 2002), and ROUGE (Lin, 2004). We consider the case of self-critical sequence training (SCST) (Rennie et al., 2017) that builds on top of the baseline version of the REINFORCE (Williams, 1992) algorithm, and minimize the negative expected reward as a function of the network parameters , as where is the distribution over output spaces, is a metric evaluation function acting as a reward function that takes a sampled report and a ground truth report . The baseline in SCST has been replaced with the reward obtained with testing time greedily decoded report .

3.3 Novel Reward for Clinically Accurate Reinforcement Learning

One major downside with the approach outlined so far, unfortunately, is that in the clinical context, aiming for a good automatic metric such as CIDEr is not enough to correctly characterize the disease states. Negative judgments on diseases are critical components of the reports, by which radiologist indicates that the patient might not have those diseases that were of concern and among the reasons for the examination. Li et al. (2018) indicated that a good portion of chest X-ray reports are heavily templated in patterns such as no pneumothorax or pleural effusion; the lungs are clear; or no focal consolidation, pneumothorax or large pleural effusion. These patterns also suggest that most patients are disease-free, hence the signal of positive mentions of disease will be sparse.

Simply optimizing the automatic LG metrics may misguide the model to mention only the disease names as opposed to correctly positively/negatively describe the disease states. For example, if the ground truth report read no pleural effusion, the models would prefer the text mild pleural effusion over unrelated text or even an empty string, which means intelligent optimization systems could game these metrics at the expense of clinical accuracy.

We hence propose using a Clinically Coherent Reward (CCR), which utilizes a rule-based disease mention annotator , CheXpert (Irvin et al., 2019), to optimize our generated report for clinical efficacy directly. CheXpert performs classification on 12 types of thoracic diseases or X-ray related diagnoses. The mentions for support devices are also labeled. For each label type , there are four possible outcomes for the labeling: (1) positive, (2) negative, (3) uncertain, or (4) absent mention; or, . This outcome can be used to model the positive/negative disease state as . CCR is then defined, dropping the subscripts for distribution for convenience, as


aiming to maximize the correlation of distribution over disease states between the generated text and the ground truth text . Unfortunately, as the true diagnostic state of novel reports is unknown, we need to make several assumptions regarding the performance of the rule based labeler, allowing us to infer the necessary conditional probabilities .

To motivate these assumptions, first note that these diseases are universally rare, or, . Presuming the rule based labeler has any discriminative power, we can thus conclude that if the labeler assigns a negative or an absent label (), For sufficiently rare conditions, a reasonable assumption and simplification is to therefore take and . We further assume that the rule based labeler has a very high precision, and thus . However, given an uncertain mention

, the desired output probabilities are difficult to assess. As such, we define a reward-specific hyperparameter

, which in this work we take to be . All of these assumptions could be easily adjusted, but they perform well for us here.

We also wish to use a baseline for the reward . Instead of using a single exponential moving average (EMA) over the total reward, we apply EMA separately to each term as


where is an EMA over updated as .

We wish to pursue both semantic alignment and clinical coherence with the ground truth report, and thus we combine the above rewards for reinforcement learning on our neural network in a weighted fashion. Specifically, , where controls the relative importance.

Hence the derivative of the combined loss with respect to is thus


where is the probability over vocabulary. We can approximate the above gradient with Monte-Carlo samples from and average gradients across training examples in the batch.

4 Experiments

4.1 Datasets

In this work, we use two chest X-ray/report datasets: MIMIC-CXR (Johnson et al., 2019) and Open-I (Demner-Fushman et al., 2015).

MIMIC-CXR is the largest radiology dataset to date and consists of chest X-ray images and reports from patients. Among these images, are of anteroposterior (AP), are of posteroanterior (PA), and are of lateral (LL) views. Furthermore, we eliminate duplicated radiograph images with adjusted brightness level or contrast as they are commonly produced for clinical needs, after which we are left with images and reports. The radiological reports are parsed into sections, among which we extract the findings section. We then apply tokenization and keep tokens with at least occurrences in the corpus, resulting in tokens in total.

Open-I is a public radiography dataset collected by Indiana University with chest X-ray images and reports. The reports are in XML format and include pre-parsed sections. We then exclude the entries without the findings section and are left with images and reports. Tokenization is done similarly, but due to the relatively small size of the corpus, we keep tokens with or more occurrences, ending up with tokens.

Both datasets are partitioned by patients into a train/validation/test ratio of 7/1/2 so that there is no patient overlap between sets. Words that are excluded were replaced by an “unknown” token, and the word embeddings are pretrained separately for each dataset.

4.2 Evaluation Metrics

To compare with other models including prior state-of-the-art and baselines, we adopt several different metrics that focus on different aspects ranging from a natural language perspective to clinical adequacy.

Automatic LG metrics such as CIDEr-D (Vedantam et al., 2015), ROUGE-L (Lin, 2004), and BLEU (Papineni et al., 2002) measure the statistical relation between two text sequences. One concern with such statistical measures is that with a limited scope from the -grams ( up to ) we are unable to capture disease states, as negations are common in the medical corpus and oftentimes the negation cue words and disease words can be far apart in a sentence. As such, we also include medical abnormality detection as a metric. Specifically, we compare the CheXpert (Irvin et al., 2019) labeled annotations between the generated report and the ground truth report on

different categories related to thoracic diseases and support devices. We evaluate the accuracy, precision, and recall for all models.

4.3 Models

We compare our methods with state-of-the-art image captioning and medical report generation models as well as some simple baseline models: (a) 1-NN, in which we query a test image for the closet X-ray image in the training set in terms of L2 distance in the image embedding space, and use the corresponding training report as the output text; (b) Show and Tell (S&T) (Vinyals et al., 2015); (c) Show, Attend, and Tell (SA&T) (Xu et al., 2015); and (d) TieNet (Wang et al., 2018). To allow comparable results in all models, we slightly modify previous models to also accept the view position embedding which encodes AP/PA/LL as a one-hot vector to utilize the extra information available at image acquisition. This includes Show and Tell, Show, Attend, and Tell, and our re-implementation of TieNet, which is detailed in Appendix B because the authors did not release their code.

We observed our model to sometimes repeat the findings multiple times. We apply post-hoc processing where we remove exact duplicate sentences in the generated reports. This proves to improve the readability but interestingly slightly degrades NLG metrics.

Additionally, we perform several ablation studies to inspect the contribution of various components of our model. In particular, we assess

Ours ()

Use only for reinforced learning, as often is the case with the prior state-of-the-art.

Ours ()

Use only and do not care about aligning the natural language metrics.

Ours (full)

Considers both rewards as formulated in Section 3.3.

In order to provide some context to the metric scores, we also trained an unsupervised RNN language model which generates free text without conditioning on input radiograph images, which we denote as Noise-RNN. All recurrent models, including prior works and our models, use beam search with a beam size of .

5 Results & Discussion

5.1 Quantitative Results

Natural Language Metrics

Model Natural Language Clinical
MIMIC-CXR Major Class - - - - - - 0.828
Noise-RNN 0.716 0.272 0.269 0.172 0.113 0.074 0.803
1-NN 0.755 0.244 0.305 0.171 0.098 0.057 0.818
S&T 0.886 0.300 0.307 0.201 0.137 0.093 0.837
SA&T 0.967 0.288 0.318 0.205 0.137 0.093 0.849
TieNet 1.004 0.296 0.332 0.212 0.142 0.095 0.848
Ours () 1.153 0.307 0.352 0.223 0.153 0.104 0.834
Ours () 0.956 0.284 0.294 0.190 0.134 0.094 0.868
Ours (full) 1.046 0.306 0.313 0.206 0.146 0.103 0.867
Open-I Major Class - - - - - - 0.911
Noise-RNN 0.747 0.291 0.233 0.130 0.087 0.061 0.914
1-NN 0.728 0.201 0.232 0.116 0.051 0.018 0.911
S&T 0.926 0.306 0.265 0.157 0.105 0.073 0.915
SA&T 1.276 0.313 0.328 0.195 0.123 0.080 0.908
TieNet 1.334 0.311 0.330 0.194 0.124 0.081 0.902
Ours () 1.490 0.359 0.369 0.246 0.171 0.115 0.916
Ours () 0.707 0.244 0.162 0.084 0.055 0.036 0.917
Ours (full) 1.424 0.354 0.359 0.237 0.164 0.113 0.918
Table 2: Automatic Evaluation Scores. The table is divided into natural language metrics and clinical finding accuracy scores. BLEU- counts up -gram for evaluation, and accuracy is the averaged macro accuracy across all clinical findings. Major class always predicts negative findings.

In Table 2 we show the automatic evaluation scores for baseline models, prior works, and variants of our models on the aforementioned test sets. Ours (), that solely optimizes CIDEr score, achieves superior performance in terms of natural language metrics, but its clinical meaningfulness is not significantly above the major class in which we predict all patients to be disease-free. This phenomenon is common among all other models that do not consider clinical alignment between the ground truth and the generated reports. On the other hand, in our full model, if we consider both natural language and clinical coherence, we can achieve the highest clinical disease annotation accuracy while still retaining decently high NLG metrics.

We also conducted the ablation study with the model variant Ours (), where we use reinforcement learning on only the clinical accuracy. It is clear that we are unable to achieve higher clinical coherence, though readability might be sacrificed. We thus conclude that a combination of both NLG metrics and a clinically sensible objective is crucial in training a useful model in clinical practice.

One thing to note is that although Noise-RNN is not dependent on the image, its NLG metrics, especially ROUGE, are not far off from models learned with supervision. We also note that MIMIC-CXR is better for training such an encoder-decoder model not just for its larger volume of data, but also due to its higher proportion of positive disease annotations at while Open-I only has . This discrepancy leads to a 156 times increase in the number of images from diseased patients.

Clinical Efficacy Metrics

Label Count 1-NN S&T SA&T TieNet Ours () Ours () Ours (full)
MIMIC-CXR Total 69031 - - - - - - -
No Finding 15677 0.432 0.299 0.349 0.339 0.339 0.491 0.405
Enlarged Cardiomediastinum 6064 0.123 0.134 0.163 0.179 0.180 0.202 0.167
Cardiomegaly 19065 0.440 0.535 0.438 0.464 0.000 0.678 0.704
Lung Lesion 2447 0.064 0.333 0.223 0.000 0.000 0.000 0.000
Airspace Opacity 21972 0.432 0.607 0.592 0.571 0.453 0.640 0.460
Edema 6594 0.265 0.331 0.244 0.405 0.266 0.280 0.000
Consolidation 2384 0.076 0.013 0.180 0.151 0.089 0.037 0.000
Pneumonia 3068 0.065 0.106 0.091 0.082 0.075 0.000 0.400
Atelectasis 16161 0.374 0.490 0.496 0.470 0.385 0.476 0.521
Pneumothorax 2636 0.079 0.034 0.095 0.081 0.081 0.039 0.098
Pleural Effusion 15283 0.534 0.550 0.545 0.735 0.487 0.683 0.689
Pleural Other 1285 0.039 0.000 0.103 0.000 0.000 0.000 0.000
Fracture 2617 0.059 0.000 0.000 0.000 0.000 0.000 0.000
Support Devices 22227 0.534 0.823 0.847 0.827 0.794 0.849 0.880
Precision (macro) 0.253 0.304 0.312 0.307 0.225 0.313 0.309
Precision (micro) 0.383 0.414 0.430 0.473 0.419 0.634 0.586
Recall (macro) 0.265 0.173 0.232 0.220 0.209 0.126 0.134
Recall (micro) 0.400 0.276 0.367 0.355 0.360 0.227 0.237
Table 3: Clinical Finding Scores. The precision scores for each of the labels are listed and aggregated into the overall precision scores. Recall scores are shown in the last two rows. Macro denotes averaging the numbers in the table directly and micro accounts for class prevalence.

In Table 3 we can compare the labels annotated by CheXpert calculated over all test set generated reports. Note that the labeling process generates discrete binary label as opposed to predicting continuous probabilities, and as such we are unable to obtain discriminative metrics such as the Area Under the Receiver Operator Characteristic (AUROC) or the Area Under the Precision-Recall Curve (AUPRC). Precision-wise, Ours () achieves the highest overall scores including macro-average and micro-average. The runner-up is Ours (full) model, which additionally considers language fluency. Note that the macro- metrics can be quite noisy as the per-class metric can be dependent on just a few examples. Many entries in the table are zeros, as they never yield positive predictions and we regard them as zeros to penalize such behavior. Regarding the recall metric, we are able to see a substantial drop in Ours () and Ours (full) as a result of optimizing for accuracy. Accuracy is closely associated with precision but overpursuing it might harm in terms of recall. It is worthwhile to notice that the nearest neighbor 1-NN has the highest recall, and this is no surprise since as shown before (Hendrik Strobelt, ), generated sequences tend to follow the statistics and favor common words too much. Rare combinations of tokens in the corpus can be easily neglected, resulting in predictions of mostly major classes. Typically in language generation, we do not and are unable to re-weight rare occurrences as we often can do in the image classification task.

5.2 Qualitative Results

Evaluation of Generated Reports

Ground Truth TieNet Ours (full)
cardiomegaly is moderate. bibasilar atelectasis is mild. there is no pneumothorax. a lower cervical spinal fusion is partially visualized. healed right rib fractures are incidentally noted. ap portable upright view of the chest. there is no focal consolidation, effusion, or pneumothorax. the cardiomediastinal silhouette is normal. imaged osseous structures are intact. pa and lateral views of the chest. there is mild enlargement of the cardiac silhouette. there is no pleural effusion or pneumothorax. there is no acute osseous abnormalities.
as compared to the previous radiograph, the monitoring and support devices are unchanged. unchanged bilateral pleural effusions, with a tendency to increase, and resultant areas of atelectasis. the air collection in the bilateral soft tissues is slightly decreased. unchanged right picc line. no definite evidence of pneumothorax. as compared to the previous radiograph, the patient has received a nasogastric tube. the course of the tube is unremarkable, the tip of the tube projects over the middle parts of the stomach. there is no evidence of complication, notably no pneumothorax. the other monitoring and support devices are constant. constant appearance of the cardiac silhouette and of the lung parenchyma. as compared to the previous radiograph, there is no relevant change. tracheostomy tube is in place. there is a layering pleural effusions. NAME bilateral pleural effusion and compressive atelectasis at the right base. there is no pneumothorax.
the course of the dobbhoff feeding catheter is unremarkable, and its tip is seen projecting over the stomach. there is no evidence of complications, specifically no pneumothorax. as compared to the prior radiograph dated DATE, there has been no other significant interval change. ap portable upright view of the chest. overlying ekg leads are present. there is no focal consolidation, effusion, or pneumothorax. the cardiomediastinal silhouette is normal. imaged osseous structures are intact. as compared to the previous radiograph, there is no relevant change. the endotracheal tube terminates approximately 3 cm above the NAME. the endotracheal tube extends into the stomach. there is no evidence of complications, notably no pneumothorax. there is no pleural effusion or pneumothorax.
interval placement of a left basilar pigtail chest tube with improving aeration in the left mid to lower lung and near complete resolution of the pleural effusion. there are residual patchy opacities within the left mid and lower lung as well as at the right base favoring resolving atelectasis. no pneumothorax is appreciated on this semi upright study. heart remains stably enlarged. mediastinal contours are stably widened, although this NAME be related to portable technique and positioning. this can be better evaluated on followup imaging. no pulmonary edema. as compared to the previous radiograph, the patient has been extubated. the nasogastric tube is in unchanged position. the lung volumes remain low. moderate cardiomegaly with minimal fluid overload but no overt pulmonary edema. no larger pleural effusions. no pneumonia. ap upright and lateral views of the chest. there is moderate cardiomegaly. there is no pleural effusion or pneumothorax. there is no acute osseous abnormalities.
Table 4: Sample images along with ground truth and generated reports. Note that upper case tokens are results of anonymization.

Table 4 demonstrates the qualitative results of our full model. In general, our model is able to generate descriptions that align with the logical flow of reports written by radiologists, which start from general information (such as views, previous comparison), positive, then negative findings, with the order of lung, heart, pleura, and others. TieNet also generates report descriptions with such logical flow but in slightly different orders. For the negative findings cases, both our model and TieNet do well on generating reasonable descriptions without significant errors. Regarding the cases with positive findings, TieNet and our full model both cannot identify all radiological findings. Our full model is able to identify the major finding in each demonstrated case. For example, cardiomegaly in the first case, pleural effusion, and atelectasis in the second case.

A formerly practicing clinician co-author reviewed a larger subset of our generated reports manually. They drew several conclusions. First, our full model tends to generate sentences related to pleural effusion, atelectasis, and cardiomegaly correctly—which is aligned with the clinical finding scores in Table 3. TieNet instead misses some positive findings in such cases. Second, there are significant issues in all generated reports, regardless of the source model, which include the description of supportive lines and tubes, as well as lung lesions. For example, TieNet is prone to generate nasogastric tube mentions while our model tends to mention tracheostomy or endotracheal tube, and yet both models have difficulty identifying some specific lines such as chest tube or PICC line. Similarly, both systems do not generate the sentence with positive lung parenchymal findings correctly.

From this (small) sample, we are unable to draw a conclusion whether our model or TieNet truly outperforms the other since both present with significant issues and each has strengths the other lacks. Critically, neither of them can describe the majority of the findings in the chest radiograph well, especially for positive cases, even if the quantitative metrics demonstrate the reasonable performance of the models. This illustrates that significant progress is still needed in this domain, perhaps building on the directions we explore here before these techniques could be deployed in a clinical environment.

Learning Meaningful Attention Maps

Figure 3: Visualization of the generated report and image attention maps. Different words are underlined with its corresponding attention map shown in the same color. Best viewed in color.

Attention maps have been a useful tool in visualizing what a neural network is attending to, as demonstrated by Rajpurkar et al. (2017). Figure 3 shows the intermediate attention maps for each word when it is being generated. As we can observe, the model is able to roughly capture the location of the indicated disease or parts, but we also find, interestingly, that the attention map tends to be the complement of the actual region of interest when the disease keywords follow a negation cue word. This might indicate that the model is actively looking at the rest of the image to ensure it does not miss any possible symptoms exhibited before asserting disease-free states. This behavior has not been widely discussed before, partially because attention maps for negations are not the primary focus of typical image captioning tasks, and most attention mechanisms employed in a clinical context were on classification tasks where negation is formulated differently.

6 Conclusion

6.1 Limitations & Future Work

Our work has several notable limitations and opportunities for future work. First and foremost, the post-processing step required to remove repeated sentences is an ugly necessity, and we endeavor to remove it in future iterations of this work. Promising techniques exist in NLG for the inclusion of greater diversity, which warrant further investigation here.

Secondly, our model operates using images in isolation, without consideration of whether these images are part of a series of ordered radiographs for a single patient, which might be summarized together. Using all available information has the potential to improve the quality of the generated reports, and should definitely be investigated further.

Lastly, we note that though our model yields very strong performance for CheXpert precision, its recall is much worse. Recall versus precision is favored to different degrees in differing clinical contexts. For example, for screening purpose, recall (sensitivity) is an ideal metric since the healthy cases usually won’t give positive findings. However, precision (positive predictive value) is much more critical for validating the clinical impression, which is common in an ICU setting where patients receive a radiological study on the basis of strong clinical suspicion. We believe that our system’s poor recall is a direct result of the setup of our RL models and the CCR reward, which optimizes for accuracy and inherently boosts precision. It is the choice of optimization objectives that lead to the results. Depending on the actual clinical applications, we may, in turn, optimize Recall at Fixed Precision (R@P) or score via methods described by Eban et al. (2016).

6.2 Reflections on Trends in the Field

In the course of this work, we also encounter several other larger points which are present not only in our study but also in many related studies in this domain and warrant further thought by the community.

System Generalizability

CheXpert used in our models is rule-based, which is harder to generalize to other datasets and to identify the implicit features inside the language patterns. CheXpert is also specialized in English and would require considerable work to re-code its rules for other natural languages. A more universal approach for subsequent research may use a learning-based approach for labeling to improve generalizability and extend to corpora in different languages; for example, PadChest in Spanish.

Be Careful What You Wish For

NLG metrics are known to be only limited substitutes for a true assessment of readability (Kilickaya et al., 2016; Liu et al., 2016). For radiology reports more specifically, this problem is even more profound, as prior works often use “readability” as a proxy for clinical efficacy. Additionally, we note that these NLG evaluation metrics are easily susceptible to gaming. In our results, our post-processing step of removing exact duplicates actually worsens our CIDEr score, which is the opposite of what should be desired for an NLG evaluation metric. Even if our proposed clinical coherence aims at resolving the unwanted misalignment between NLG and real practice, we are not able to obviously judge whether our system is better despite its performance on paper. This fact is especially troubling given the increasing trend of using reinforcement learning (RL) to directly optimize objectives, as has been done in prior work (Li et al., 2018) and as we do here. Though RL can offer marked improvements in these automatic metrics, which are currently the best the field can do, how well it translates to the real clinical efficacy is unclear. The careful design of improved evaluation metrics, specifically for radiology report generation, should be a prime focus for the field going forward.

6.3 Conclusion

In this work, we develop a chest X-Ray radiology report generation system which hierarchically generates topics from images, then words from topics. This structure gives the model the ability to use largely templated sentences (through the generation of similar topic vectors) while preserving its freedom to generate diverse text. The final system is also optimized with reinforcement learning for both readability (via CIDEr) and clinical correctness (via the novel Clinically Coherent Reward). Our system outperforms a variety of compelling baseline methods across readability and clinical efficacy metrics on both MIMIC-CXR and Open-I datasets.


Dr. Marzyeh Ghassemi is partially funded by a CIFAR AI Chair at the Vector Institute, and an NSERC Discovery Grant.


Appendix A Implementation Details

We briefly describe the details of our implementation in this section.


The image encoder CNN takes an input image of size . The last layer before global pooling in a DenseNet-121 are extracted, which has a dimension of , and thus and . Densenet-121 (Iandola et al., 2014) has been shown to be state-of-the-art in the context of classification for clinical images. The image features are then projected to dimensions with a dropout of .

Since typically in the X-ray image acquisition we are provided with the view position indicating the posture of the patient related to the machine, we conveniently pass this into the model as well. Indicated by a one-hot vector, the view position embedding is concatenated with the image embedding to form an input to the later decoders.


As previously mentioned, the input image embedding to the LSTM has a dimension of , and it is the same for word embeddings and hidden layer sizes. The word embedding matrix is pretrained with Gensim (Rehurek and Sojka, 2010) in an unsupervised manner.

Training Details

We implement our model on PyTorch (Paszke et al., 2017) and train on 4 GeForce GTX TITAN X GPUs. All models are first trained with cross-entropy loss with the Adam (Kingma and Ba, 2014) optimizer using an initial learning rate of and a batch size of for epochs. Other than the weights stated above, the models are initialized randomly. Learning rates are annealed by every epochs and we increase the probability of feeding back a sample from the posterior by every epochs. After this bootstrapping stage, we start training with REINFORCE for another epochs. The initial learning rate for the second stage is and is annealed on the same schedule.

Indicated by Rennie et al. (2017), we adopt CIDEr-D (Vedantam et al., 2015) metric as the reward module used in . For the baseline for CCR, we choose a EMA momentum . A weighting factor has been chosen to balance the scales of the rewards for our full model.

Appendix B TieNet Re-implementation

Since the implementation for TieNet (Wang et al., 2018) is not released, we re-implement it with the descriptions provided by the original authors. The re-implementation details are described in this section.


TieNet stands for Text-Image Embedding Network. It consists of three main components: image encoder, sentence decoder with Attention Network, and Joint Learning Network. It computes a global attention encoded text embedding using hidden states from a sentence decoder and saliency weighted global average pooling using attention maps from the attention network. The two global representations are combined as an input to the joint learning network. Finally, it outputs the multi-label classification of thoracic diseases. The end products are automatic report generation for medical images and classification of thoracic diseases.


An image of size is taken by the image encoder CNN as an input. The last two layers of ResNet-101 (He et al., 2016) are removed since we are not classifying the image. The final encoding produced has a size of . We also fine-tune convolutional blocks conv2 through conv4 of our image encoder during training time.


We also include the view position information by concatenating the view position embedding with the image embedding to form input. The view position embedding is indicated by a one-hot vector. At each decoding step, the encoded image and the previous hidden state with a dropout of is used to generate weights for each pixel in the attention network. The previously generated word and the output from the attention network are fed to the LSTM decoder to generate the next word.

Joint Learning Network

TieNet proposed an additional component to automatically classify and report thoracic diseases. The joint learning network takes hidden states and attention maps from the decoder and computes global representations for report and images, then combines the result as the input to a fully connected layer to output disease labels.

In the original paper, indicates the number of attention heads, which we set as ; is the hidden size for attention generation, which we set as . One key difference from the original work is that we are classifying the joint embeddings into CheXpert (Irvin et al., 2019) annotated labels, and hence we have the class count . The disease classification cross-entropy loss and the teacher-forcing report generation loss are combined as , in which is the loss for which the network optimizes. However, the value was not disclosed in the original work and we use .


We implement TieNet on PyTorch (Paszke et al., 2017) and train on GeForce GTX TITAN X GPUs. The decoder is trained with cross-entropy loss with the Adam (Kingma and Ba, 2014) optimizer using an initial learning rate of and a mini-batch size of for epochs. Learning rate for the decoder is decayed by a factor of if there is no improvement of BLEU (Papineni et al., 2002) score on the development set in consecutive epochs. The joint learning network is trained with sigmoid binary cross-entropy loss with the Adam (Kingma and Ba, 2014) optimizer using a constant learning rate of .


Since we are not able to access the original implementation of TieNet and we additionally inject view position information to the model, we might have small variations in result between the original paper and our re-implementation. We only compare the report generation part of TieNet to our model.