Automated Generation of Accurate & Fluent Medical X-ray Reports

by   Hoang T. N. Nguyen, et al.

Our paper focuses on automating the generation of medical reports from chest X-ray image inputs, a critical yet time-consuming task for radiologists. Unlike existing medical re-port generation efforts that tend to produce human-readable reports, we aim to generate medical reports that are both fluent and clinically accurate. This is achieved by our fully differentiable and end-to-end paradigm containing three complementary modules: taking the chest X-ray images and clinical his-tory document of patients as inputs, our classification module produces an internal check-list of disease-related topics, referred to as enriched disease embedding; the embedding representation is then passed to our transformer-based generator, giving rise to the medical reports; meanwhile, our generator also pro-duces the weighted embedding representation, which is fed to our interpreter to ensure consistency with respect to disease-related topics.Our approach achieved promising results on commonly-used metrics concerning language fluency and clinical accuracy. Moreover, noticeable performance gains are consistently ob-served when additional input information is available, such as the clinical document and extra scans of different views.



There are no comments yet.


page 1

page 2

page 3

page 4


Clinically Accurate Chest X-Ray Report Generation

The automatic generation of radiology reports given medical radiographs ...

Automated Knee X-ray Report Generation

Gathering manually annotated images for the purpose of training a predic...

TieNet: Text-Image Embedding Network for Common Thorax Disease Classification and Reporting in Chest X-rays

Chest X-rays are one of the most common radiological examinations in dai...

Unsupervised Multimodal Representation Learning across Medical Images and Reports

Joint embeddings between medical imaging modalities and associated radio...

CREATe: Clinical Report Extraction and Annotation Technology

Clinical case reports are written descriptions of the unique aspects of ...

XRayGAN: Consistency-preserving Generation of X-ray Images from Radiology Reports

To effectively train medical students to become qualified radiologists, ...

Automated Radiological Report Generation For Chest X-Rays With Weakly-Supervised End-to-End Deep Learning

The chest X-Ray (CXR) is the one of the most common clinical exam used t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Medical reports are the primary medium, through which physicians communicate findings and diagnoses from the medical scans of patients. The process is usually laborious, where typing out a medical report takes on average five to ten minutes Jing et al. (2018); it could also be error-prone. This has led to a surging need for automated generation of medical reports, to assist radiologists and physicians in making rapid and meaningful diagnoses. Its potential efficiency and benefits could be enormous, especially during critical situations such as COVID or a similar pandemic. Clearly a successful medical report generation process is expected to possess two key properties: 1) clinical accuracy, to properly and correctly describe the disease and related symptoms; 2) language fluency, to produce realistic and human-readable text.

Fueled by recent progresses in the closely related computer vision problem of image-based captioning 

Vinyals et al. (2015); Tran et al. (2020), there have been a number of research efforts in medical report generation in recent years Jing et al. (2018, 2019); Li et al. (2018, 2019); Xue et al. (2018); Yuan et al. (2019); Wang et al. (2018); Yin and others (2019); Lovelace and Mortazavi (2020); Srinivasan et al. (2020). These methods often perform reasonably well in addressing the language fluency aspect; on the other hand, as is also evidenced in our empirical evaluation, their results are notably less satisfactory in terms of clinical accuracy. This we attribute to two reasons: one is closely tied to the textual characteristic of medical reports, which typically consists of many long sentences describing various disease related symptoms and related topics in precise and domain-specific terms. This clearly sets the medical report generation task apart from a typical image-to-text problem such as image-based captioning; another reason is related to the lack of full use of rich contextual information that encodes prior knowledge. These information include for example clinical document of the patient describing key clinical history and indication from doctors, and multiple scans from different 3D views – information that are typically existed in abundance in practical scenarios, as in the standard X-ray benchmarks of Open-I Demner-Fushman et al. (2016) and MIMIC-CXR Johnson et al. (2019).

Figure 1: Our approach consists of three modules: a classifier that reads chest X-ray images and clinical history to produce an internal checklist of disease-related topics; a transformer-based generator to generate fluent text; an interpreter

to examine and fine-tune the generated text to be consistent with the disease-related topics.

The aforementioned observations motivate us to propose a categorize-generate-interpret framework that places specific emphasis on clinical accuracy while maintaining adequate language fluency of the generated reports: a classifier module reads chest X-ray images (e.g., either single-view or multi-view images) and related documents to detect diseases and output enriched disease embedding; a transformer-based medical report generator; and a differentiable interpreter to evaluate and fine-tune the generated reports for factual correctness. The main contributions are two-fold:

  • We propose a differentiable end-to-end approach consists of three modules (classifier-generator-interpreter) where the classifier module learns the disease feature representation via context modeling (section 3.1.3) and disease-state aware mechanism (section 3.1.4); the generator module transforms the disease embedding to medical report; the interpreter module reads and fine-tunes the generated reports, enhancing the consistency of the generated reports and the classifier’s outputs.

  • Empirically our approach is demonstrated to be more competitive against many strong baselines over two widely-used benchmarks on an equal footing (i.e. without accessing to additional information). We also find that the clinical history of patients (prior-knowledge) play a vital role in improving the quality of the generated reports.

2 Related Work

2.1 Image-based Captioning and Medical Report Generation

Apart from some familiar topics such as disease detection Oh et al. (2020); Luo et al. (2020); Lu et al. (2020b); Rajpurkar et al. (2017); Lu et al. (2020a); Ranjan et al. (2018) and lung segmentation Eslami et al. (2020), the most related computer vision task is the emerging topic of image-based captioning, which aims at generating realistic sentences or topic-related paragraphs to summarize visual contents from images or videos Vinyals et al. (2015); Xu et al. (2015); Goyal et al. (2017); Rennie et al. (2017); Huang et al. (2019); Feng et al. (2019); Pei et al. (2019); Tran et al. (2020). Not surprisingly, the recent progresses in medical report generation Jing et al. (2018, 2019); Li et al. (2018, 2019); Xue et al. (2018); Yuan et al. (2019); Wang et al. (2018); Yin and others (2019); Lovelace and Mortazavi (2020); Srinivasan et al. (2020); Zhang et al. (2020); Huang et al. (2021); Gasimova et al. (2020); Singh et al. (2019); Nishino et al. (2020) have been particularly influenced by the successes in image-based captioning.

The work of Vinyals et al. (2015); Xu et al. (2015)

is among the early approaches in medical report generation, where visual features are extracted by convolution neural networks (CNNs); they are subsequently fed into recurrent neural networks (RNNs) to generate textual descriptions. In remedying the issue of inaccurate textual descriptions, a secondary task is explicitly adopted by 

Jing et al. (2018); Srinivasan et al. (2020) to select top- most likely diseases to gauge report generation. The methods of Jing et al. (2019); Li et al. (2018)

, on the other hand, consider a reinforcement learning process to promote generating reports with correct contents. It has been noted by 

Jing et al. (2018, 2019); Li et al. (2018) that traditional RNNs are not well suited in generating long sentences and paragraphs Vaswani et al. (2017); Krause et al. (2017), which renders them insufficient in medical report generation task Jing et al. (2018). This issue is relieved by either conceiving hierarchical RNN architectures Krause et al. (2017) Jing et al. (2018, 2019); Li et al. (2018); Xue et al. (2018); Yuan et al. (2019); Wang et al. (2018); Yin and others (2019), or resorting to alternative techniques including in particular the recently developed transformer architectures Vaswani et al. (2017) Srinivasan et al. (2020); Lovelace and Mortazavi (2020).

It is worth noting that most existing methods concentrates on the image-to-fluent-text aspect of the medical report generation problem; on the other hand, their results are considerably less well-versed at uncovering the intended disease and symptom related topics in the generated texts, the true gems where the physicians would base their decisions upon. To alleviate this issue, a graph-based approach is considered in Li et al. (2019): it starts by compiling a list of common abnormalities, then transforms them into correlated disease graphs, and categorizes medical reports into templates for paraphrasing. Its practical performance is however less stellar, which may be credit to the fact that Li et al. (2019) is fundamentally based on detecting abnormalities from medical images, thus may overlook other important information.

Figure 2: An example of our approach in action. The enriched disease embedding produced from the classification module are fed into the generation module as initial inputs. Then, at each time step, the hidden state is obtained and predicts the next output word. Finally, the interpretation module takes as input all predicted outputs to predict a checklist of disease-related topics, which are to be gauged with the same topics output from the classification module for consistency verification.

2.2 Transformers

The transformer technique Vaswani et al. (2017) is first introduced in the context of machine translation with the purpose of expediting training and improving long-range dependency modeling. They are achieved by processing sequential data in parallel with an attention mechanism, consisting of a multi-head self-attention module and a feed-forward layer. By considering multi-head self-attention mechanisms, including e.g. a graph attention network Veličković et al. (2017), recent transformer-based models have shown considerable advancement in many difficult tasks, such as image generation Chen et al. (2020), story generation Radford et al. (2018), question answering, and language inference Devlin et al. (2018).

2.3 CheXpert Labeler

The CheXpert labeler Irvin et al. (2019)

is a rule-based system that extracts and classifies medical reports into 14 common diseases. Each disease label is either positive, negative, uncertain, or unmentioned. This is a crucial part in building large-scale chest X-ray datasets, such as 

Irvin et al. (2019); Johnson et al. (2019), where an alternative manual labeling process may take years of effort. It could also be used to evaluate the clinical accuracy of a generated medical report Liu et al. (2019)

. Another important use of the CheXpert labeler is to facilitate the generation of medical reports. Since the rule-based CheXpert labeler is not differentiable, it is regarded as a score function estimator for reinforcement learning models 

Liu et al. (2019) to fine-tune the generated texts. However, the reinforcement learning methods are often computationally expensive and practically difficult to convergence. As an alternative, Lovelace et al. Lovelace and Mortazavi (2020) propose an attention LSTM model and fine-tune the generated report via a differentiable Gumbel random sampling trick, with promising results.

3 Our Approach

Our framework consists of a classification module, a generation module, and an interpretation module, as illustrated in Fig. 1. The classification module reads multiple chest X-ray images and extracts the global visual feature representation via a multi-view image encoder. They are then disentangled into multiple low-dimensional visual embedding. Meanwhile, the text encoder reads clinical documents, including, e.g., doctor indication, and summarizes the content into text-summarized embedding. The visual and text-summarized embeddings are entangled via an “add & layerNorm” operation to form contextualized embedding in terms of disease-related topics. The generation module takes our enriched disease embedding as initial input and generates text word-by-word, as shown in Fig. 2. Finally, the generated text is fed to the interpretation module for fine-tuning to align to the checklist of disease-related topics from the classification module. In what follows, we are to elaborate on these three modules in detail.

3.1 The Classification Module

3.1.1 Multi-view Image Encoder

For each medical study which consists of chest X-ray images , we extract the corresponding latent features , where is the number of features, via a shared DenseNet-121 image encoder Huang et al. (2017). Then, the multi-view latent features

can be obtained by max-pooling across the set of

latent features , as proposed in Su et al. (2015). When , the multi-view encoder boils down to a single-image encoder.

3.1.2 Text Encoder

Let be a text document with length consisting of word embeddings , where embodies the -th word in the text and is the embedding dimension. We use the transformer encoder Vaswani et al. (2017) as our text feature extractor to retrieve a set of hidden states , where is the attended features of the -th word to other words in the text,


The entire document is then summarized by , representing disease-related topics (e.g., pneumonia or atelectasis) to be queried from the document. We refer to this retrieval process as text-summarized embedding ,


Here matrix

is formed by stacking the set of vectors

where is randomly initialized, then learned via the attention process. Similarly, the matrix is formed by from Eq. (1). The term is the word attention heat-map for the queried diseases in the document. The intuition here is for each disease (e.g., pneumonia) to be queried from the text document . We only pay attention to the most relevant words (e.g., cough or shortness of breath) in the text that associates with that disease, also known as a vector similarity dot product. This way, the weighted sum of these words by Eq. (2) gives the feature that summarizes the document w.r.t. the queried disease.

3.1.3 Contextualized Disease Embedding

The latent visual features are subsequently decoupled into low-dimensional disease representations, as illustrated in Fig. 1. They are regarded as the visual embedding , where each row is a vector defined as follows:


Here and are learnable parameters of the -th disease representation. is the number of disease representations, and is the embedding dimension. Now, together with the available clinical documents, the visual embedding and the text-summarized embedding are entangled to form contextualized disease representations as


Intuitively, the entanglement of visual and textual information allows our model to mimic the hospital workflow, to screen the disease’s visual representations conditioned on the patients’ clinical history or doctors’ indication. For example, the doctor’s indication in Fig. 1 shows cough and shortness of breath symptoms. It is reasonable for a medical doctor to request a follow-up check of the pneumonia disease. As for the radiologists receiving the doctors’ indication, they may prioritize diagnosing the presence of pneumonia and related diseases based on X-ray scans and look for specific abnormalities. As empirically shown in Table 3, the proposed contextualized disease representations bring a significant performance boost in the medical report generation task. Meanwhile, our current embedding is basically a plain mingling of heterogeneous sources of information such as disease type (i.e., disease name) and disease state (e.g., positive or negative). As shown by the ablation study in Table 3, this embedding by itself is insufficient for generating accurate medical reports. This leads us to conceive a follow-up enriched representation below.

3.1.4 Enriched Disease Embedding

The main idea behind enriched disease embedding is to further encode informative attributes about disease states, such as positive, negative, uncertain, or unmentioned. Formally, let be the number of states and the state embedding. Then the confidence of classifying each disease into one of the disease states is


is randomly initialized, then learned via the classification of . acts as features for the multi-label classification, and the classification loss is computed as


where and are the -th ground-truth and predicted values for the disease -th, respectively. The state-aware embedding are then computed as


is the one-hot ground-truth labels about the disease-related topics, whereas is the predicted values. During training, the ground-truth disease states facilitate our generator in describing the diseases & related symptoms based on accurate information (teacher forcing). At test time, our generator then furnishes its recount based on the predicted states.

Finally, the enriched disease embedding is the composition of state-aware disease embedding (i.e., good or bad), disease names (i.e., which disease/topic), and the disease representations (i.e., severity and details of the diseases),


Like the disease queries , is randomly initialized, representing diseases or topics to be generated. It is then learned in training through the medical report generation pipeline. The enriched disease embedding provides explicit and precise disease descriptions, and endows our follow-up generation module with a powerful data representation.

3.2 The Generation Module

Our report generator is derived from the transformer encoder of Vaswani et al. (2017). The network is formed by sandwiching & stacking a masked multi-head self-attention component and a feed-forward layer being on top of each other for times, as illustrated in Fig. 2. The hidden state for each word position in the medical report is then computed based on previous words and disease embedding, as ,


This is followed by predicting future words based on the hidden states , as


Here is the entire vocabulary embedding, the vocabulary size, and the document length. Let denote the confidence of selecting the -th word in the vocabulary for the -th position in the generated medical report. The generator loss is defined as a cross entropy of the ground-truth words and predicted words ,


Finally, the weighted word embedding , also known as the generated report, are:


It is worth noting that this set-up facilitate the back-propagation of errors from the follow-up interpretation module.

3.3 The Interpretation Module

It is observed from empirical evaluations that the generated reports are often distorted in the process, such that they become inconsistent with the original output of the classification module – the enriched disease embedding that encodes the disease and symptom related topics. Inspired by the CycleGAN idea of Zhu et al. (2017), we consider a fully differentiable network module to estimate the checklist of disease-related topics based on the generator’s output, and to compare with the original output of the classification module. This provides a meaningful feedback loop to regulate the generated reports, which is used to fine-tune the generated report through the word representation outputs .

Specifically, we build on top of the proposed text encoder (described in section 3.1.2) a classification network that classifies disease-related topics, as follows. First, the text encoder summarizes the current medical report , and outputs the report-summarized embedding of the queried diseases ,


Here is computed from the generated medical reports using Eq. (1). Second, each of the report-summarized embedding (i.e., each row of the matrix ) is classified into one of the disease-related states (i.e., positive or negative), as


Finally, the interpreter is trained to minimize the subsequent multi-label classification loss,


here is the ground-truth disease label and is the predicted disease label of the interpreter.

In fine-tuning the generated medical reports , all interpreter parameters are frozen, which acts as a guide to force the word representations being close to what the interpreter has learned from the ground-truth medical reports. If the weighted word embedding is different from the learned representation – which leads to incorrect classification – a large loss value will be imposed in the interpretation module. This thus forces the generator to move toward producing a correct word representation.

Collectively our model is trained in an end-to-end manner by jointly minimizing the total loss,


width=2center Datasets Methods B-1 B-2 B-3 B-4 MTR RG-L SV MV AI FT Open-I S&T  Vinyals et al. (2015) 0.316 0.211 0.140 0.095 0.159 0.267 x LRCN  Donahue et al. (2015) 0.369 0.229 0.149 0.099 0.155 0.278 x SA&T  Xu et al. (2015) 0.399 0.251 0.168 0.118 0.167 0.323 x Att-RK  You et al. (2016) 0.369 0.226 0.151 0.108 0.171 0.323 x HRNN  Yin et al. (2019) 0.445 0.292 0.201 0.154 0.175 0.344 x 1-NN  Boag et al. (2020) 0.232 0.116 0.051 0.018 N/A 0.201 x TieNet  Wang et al. (2018) 0.330 0.194 0.124 0.081 N/A 0.311 x Liu et. al.  Liu et al. (2019) 0.359 0.237 0.164 0.113 N/A 0.354 x x CoAtt  Jing et al. (2018) 0.455 0.288 0.205 0.154 N/A 0.369 x HRGR-Agent  Li et al. (2018) 0.438 0.298 0.208 0.151 N/A 0.322 x x KERP  Li et al. (2019) 0.482 0.325 0.226 0.162 N/A 0.339 x x ReinforcedTransformer  Xiong et al. (2019) 0.350 0.234 0.143 0.096 N/A N/A x x HRG-Transformer 0.464 0.301 0.212 0.158 N/A N/A x SD&C  Jing et al. (2019) 0.464 0.301 0.210 0.154 N/A 0.362 x x Ours (SV) 0.463 0.310 0.215 0.151 0.186 0.377 x Ours (MV) 0.476 0.324 0.228 0.164 0.192 0.379 x Ours (MV+T) 0.485 0.355 0.273 0.217 0.205 0.422 x x Ours (MV+T+I) 0.515 0.378 0.293 0.235 0.219 0.436 x x x MIMIC 1-NN  Boag et al. (2020) 0.367 0.215 0.138 0.095 0.139 0.228 x SA&T  Xu et al. (2015) 0.370 0.240 0.170 0.128 0.141 0.310 x AdpAtt  Lu et al. (2017) 0.384 0.251 0.178 0.134 0.148 0.314 x Liu et. al.  Liu et al. (2019) 0.313 0.206 0.146 0.103 N/A 0.306 x x Transformer  Vaswani et al. (2017) 0.409 0.268 0.191 0.144 0.157 0.318 x GumbelTransformer  Lovelace and Mortazavi (2020) 0.415 0.272 0.193 0.146 0.159 0.318 x x Ours (SV) 0.447 0.290 0.200 0.144 0.186 0.317 x Ours (MV) 0.451 0.292 0.201 0.144 0.185 0.320 x Ours (MV+T) 0.491 0.357 0.276 0.223 0.213 0.389 x x Ours (MV+T+I) 0.495 0.360 0.278 0.224 0.222 0.390 x x x

Table 1: Quantitative comparison of our approach and many existing methods, evaluated under different setups of Single-view (SV), Multi-view (MV), w/ clinical text (T), and interpreter (I). For a fair comparison, all methods are categorized into the following four aspects: Single-View (SV), Multi-view (MV), Additional Information (AI), and Fine-tuning of the generated reports (FT). The best results are highlighted in bold face. Different language metrics are employed: BLEU-1 to BLEU-4 (B-1 to B-4), METEOR (MTR), and ROUGE-L (RG-L).

4 Experiments

This section evaluates the medical report generation task on two fronts: the language performance and the clinical accuracy performance. Empirical evaluations are carried out on two widely-used chest X-ray datasets, MIMIC-CXR Johnson et al. (2019) and Open-I Demner-Fushman et al. (2016).

4.1 Datasets

4.1.1 MIMIC-CXR Dataset

The MIMIC-CXR dataset Johnson et al. (2019) is a large-scale dataset with 227,835 medical reports of 65,379 patients, associated with 377,110 images from multiple views: anterior-posterior (AP), posterior-anterior (PA), lateral (LA). Each study comprises multiple sections, including comparison, clinical history, indication, reasons for examination, impressions, and findings. Here we utilize the multi-view images of AP/PA/LA views, and adopt as contextual information the concatenation of the clinical history, reason for examination, and indication sections. For consistency, we follow the experimental set-up of Lovelace and Mortazavi (2020) to focus on generating text in the “findings” section as the corresponding medical report.

4.1.2 Open-I Dataset

The Open-I dataset Demner-Fushman et al. (2016) collected by the Indiana University hospital network contains 3,955 radiology studies that correspond to 7,470 frontal and lateral chest X-rays. Some radiology studies are associated with more than one chest X-ray image. Each study typically consists of impression, findings, comparison, and indication sections. Similar to the MIMIC-CXR dataset, we utilized both the multi-view chest X-ray images (frontal and lateral) and the indication section as our contextual inputs. For generating medical reports, we follow the existing literature Jing et al. (2018); Srinivasan et al. (2020) by concatenating the impression and the findings

sections as the target output.

4.1.3 Important Note

The implementation details, dataset splits, preprocessing steps, generated examples, and qualitative analysis are described in the supplementary materials.

4.2 Experimental Results

4.2.1 Language Generation Performance

A comprehensive quantitative comparison of our approach and many baselines as shown in Table 1

on the two benchmarks using the widely-used language evaluation metrics: BLEU-1 to BLEU-4 

Papineni and others (2002), ROUGE-L Lin (2004), and METEOR Banerjee and Lavie (2005) scores. Since all comparison methods have their own experiment setups, for a fair comparison, we further categorize these methods into four aspects: single-view (SV), multi-view (MV), accessing to additional information (AI) such as clinical document, and applying fine-tuning (FT) to the generated medical reports. Experiments in Table 1 show that our models outperform the baselines in most language metrics.

With a single input X-ray image as the sole input, ours (SV) outperforms by a noticeable margin the best SOTA methods of CoAtt on Open-I and Transformer on MIMIC, respectively. This we mainly attribute to the utilization of the enriched disease embedding that explicitly incorporates the disease-related topics. With multiple X-ray images as input, Ours (MV) again outperforms the best comparison methods of HRG-Transformer on Open-I. With multiple X-ray images and additional clinical document information as input, ours (MV+T) outperforms the comparison methods of KERP on Open-I. Finally, with the complete contextual information available as input, ours (MV+T+I) outperforms all the comparison methods available in both Open-I and MIMIC datasets.

4.2.2 Clinical Accuracy Performance

width=2center Macro scores Micro scores Datasets Methods Acc. AUC F-1 Prec. Rec. AUC F-1 Prec. Rec. Open-I 1-NN  Boag et al. (2020) 0.911 N/A N/A N/A N/A N/A N/A N/A N/A S&T  Vinyals et al. (2015) 0.915 N/A N/A N/A N/A N/A N/A N/A N/A SA&T  Xu et al. (2015) 0.908 N/A N/A N/A N/A N/A N/A N/A N/A TieNet  Wang et al. (2018) 0.902 N/A N/A N/A N/A N/A N/A N/A N/A Liu et. al.  Liu et al. (2019) 0.918 N/A N/A N/A N/A N/A N/A N/A N/A Ours (SV) 0.944 0.595 0.118 0.125 0.136 0.857 0.657 0.651 0.663 Ours (MV) 0.943 0.626 0.144 0.149 0.150 0.878 0.648 0.647 0.649 Ours (MV+T) 0.947 0.671 0.130 0.192 0.124 0.873 0.659 0.687 0.634 Ours (MV+T+I) 0.937 0.702 0.152 0.142 0.173 0.877 0.626 0.604 0.649 MIMIC 1-NN  Boag et al. (2020) N/A N/A 0.206 0.213 0.200 N/A 0.335 0.346 0.324 SA&T  Xu et al. (2015) N/A N/A 0.101 0.247 0.119 N/A 0.282 0.364 0.230 AdpAtt  Lu et al. (2017) N/A N/A 0.163 0.341 0.166 N/A 0.347 0.417 0.298 Liu et. al.  Liu et al. (2019) 0.867 N/A N/A 0.309 0.134 N/A N/A 0.586 0.237 Transformer  Vaswani et al. (2017) N/A N/A 0.214 0.327 0.204 N/A 0.398 0.461 0.350 GumbelTransformer  Lovelace and Mortazavi (2020) N/A N/A 0.228 0.333 0.217 N/A 0.411 0.475 0.361 Ours (SV) 0.877 0.743 0.342 0.357 0.347 0.857 0.530 0.533 0.528 Ours (MV) 0.880 0.752 0.347 0.385 0.347 0.862 0.533 0.545 0.522 Ours (MV+T) 0.890 0.778 0.407 0.448 0.399 0.872 0.578 0.583 0.574 Ours (MV+T+I) 0.887 0.784 0.412 0.432 0.418 0.874 0.576 0.567 0.585

Table 2: Quantitative comparison of clinical accuracy from the generated reports, evaluated on the 14 common CheXpert’s diseases. The best results are highlighted in bold face.

To evaluate the clinical accuracy of the generated reports, we use the LSTM CheXpert labeler Lovelace and Mortazavi (2020) as a universal measurement. We compare different methods based on accuracy, F-1, precision (prec.), and recall (rec.) metrics on 14 common diseases. Since there are 14 independent diseases, we also report the macro and micro scores. Intuitively, a high macro score means the detection of all 14 diseases is improved. Meanwhile, a high micro score implies the dominant diseases are improved (i.e., some diseases appear more frequently than others). As observed in Table 2, our clinical performance increased significantly compared to the baselines in both macro and micro scores.

Among our ablation models in Table 2, the precision and accuracy scores of our contextualized variant (MV+T) tend to be higher, whereas other scores are lower than the one with the interpreter (MV+T+I). This opposite behavior is due to the interpreter, which encourages detecting diseases, thus increases False Positives (FP). Note in the medical context, it is usually critically important to lower the False Negatives (FN) rate, thus a high recall score with a slight decrease in precision is more preferred.

4.3 Ablation studies

Methods B-1 B-2 B-3 B-4 MTR RG-L
R w/o 0.400 0.253 0.175 0.127 0.166 0.362
R w/o 0.453 0.300 0.206 0.142 0.183 0.366
R w/o 0.468 0.310 0.215 0.151 0.189 0.373
R with 0.463 0.310 0.215 0.151 0.186 0.377
R + Interpreter 0.470 0.314 0.220 0.158 0.192 0.375
C w/o 0.404 0.286 0.215 0.169 0.183 0.396
C w/o 0.474 0.329 0.244 0.187 0.194 0.401
C w/o 0.470 0.337 0.257 0.204 0.212 0.408
C with 0.485 0.355 0.273 0.217 0.205 0.422
C + Interpreter 0.515 0.378 0.293 0.235 0.219 0.436
Table 3: The table compares a regular image-to-text version (R) and a contextualized version (C) of our proposed method that utilizes clinical history on the Open-I dataset. For each version, we evaluate the importance of each component , , and in the proposed enriched disease embedding by removing one component at a time.

4.3.1 Enriched disease embedding

We observe that the latent features extracted from the classifier are insufficient to generate robust medical reports, as shown in Table 3. Based on our human languages, a meaningful story needs three factors: the topic (i.e., what disease), the tone (i.e., is it negative or positive), and the details (i.e., the severity). However, there is no guarantee that the learned latent features has all three required elements. On the other hand, with the the explicit representations (i.e., , , and ), all three factors are preserved. Therefore, the enriched disease embedding can generate precise and complete medical reports, leading to the language metrics’ substantial improvement.

4.3.2 Contextualized embedding

Table 3 also shows that our proposed “contextualized” version can improve the language scores over the “regular” version, which reads only images. Notably, the contextualized version is the entanglement of the chest X-ray images and the clinical history, which is crucial to improve the generated report’s quality and accommodate doctors’ practical needs. It mimics how radiologists receive requests from medical doctors and write reports to answer their questions. Hence, the generated reports are believed to be more “on point” and receives higher language scores than the regular “image-to-text” setting.

5 Conclusion and Outlook

This paper introduces a novel three-module approach for generating medical reports from X-ray scans. Empirical findings demonstrated the superior performance of our approach over state-of-the-art methods on widely-used benchmarks under a range of evaluation metrics. Moreover, our approach is flexible and can work with additional input information, where consistent performance gains are observed. For future work, we plan to apply our approach to related medical report generation tasks that go beyond X-rays.


  • S. Banerjee and A. Lavie (2005) METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72. Cited by: §4.2.1.
  • W. Boag, T. H. Hsu, M. Mcdermott, G. Berner, E. Alesentzer, and P. Szolovits (2020) Baselines for chest x-ray report generation. In

    NeurIPS Workshop on Machine Learning for Health

    pp. 126–140. Cited by: Table 1, Table 2.
  • M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and I. Sutskever (2020) Generative pretraining from pixels. In International Conference on Machine Learning, pp. 1691–1703. Cited by: §2.2.
  • D. Demner-Fushman, M. D. Kohli, M. B. Rosenman, S. E. Shooshan, L. Rodriguez, S. Antani, G. R. Thoma, and C. J. McDonald (2016) Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association 23 (2), pp. 304–310. Cited by: §1, §4.1.2, §4.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.2.
  • J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell (2015) Long-term recurrent convolutional networks for visual recognition and description. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 2625–2634. Cited by: Table 1.
  • M. Eslami, S. Tabarestani, S. Albarqouni, E. Adeli, N. Navab, and M. Adjouadi (2020) Image-to-images translation for multi-task organ segmentation and bone suppression in chest x-ray radiography. IEEE Transactions on Medical Imaging 39 (7), pp. 2553–2565. External Links: Document Cited by: §2.1.
  • Y. Feng, L. Ma, W. Liu, and J. Luo (2019)

    Unsupervised image captioning

    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4125–4134. Cited by: §2.1.
  • A. Gasimova, G. Seegoolam, L. Chen, P. Bentley, and D. Rueckert (2020) Spatial semantic-preserving latent space learning for accelerated dwi diagnostic report generation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 333–342. Cited by: §2.1.
  • Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh (2017) Making the v in vqa matter: elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913. Cited by: §2.1.
  • G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §3.1.1.
  • J. Huang, C. H. Yang, F. Liu, M. Tian, Y. Liu, T. Wu, I. Lin, K. Wang, H. Morikawa, H. Chang, et al. (2021) DeepOpht: medical report generation for retinal images via deep models and visual explanation. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 2442–2452. Cited by: §2.1.
  • L. Huang, W. Wang, J. Chen, and X. Wei (2019) Attention on attention for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4634–4643. Cited by: §2.1.
  • J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. Ball, K. Shpanskaya, et al. (2019) Chexpert: a large chest radiograph dataset with uncertainty labels and expert comparison. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 590–597. Cited by: §2.3.
  • B. Jing, Z. Wang, and E. Xing (2019) Show, describe and conclude: on exploiting the structure information of chest X-ray reports. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 6570–6580. Cited by: §1, §2.1, §2.1, Table 1.
  • B. Jing, P. Xie, and E. Xing (2018) On the automatic generation of medical imaging reports. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2577–2586. Cited by: §1, §1, §2.1, §2.1, Table 1, §4.1.2.
  • A. E. Johnson, T. J. Pollard, S. J. Berkowitz, N. R. Greenbaum, M. P. Lungren, C. Deng, R. G. Mark, and S. Horng (2019) MIMIC-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data 6 (1), pp. 1–8. Cited by: §1, §2.3, §4.1.1, §4.
  • J. Krause, J. Johnson, R. Krishna, and L. Fei-Fei (2017) A hierarchical approach for generating descriptive image paragraphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 317–325. Cited by: §2.1.
  • C. Y. Li, X. Liang, Z. Hu, and E. P. Xing (2018) Hybrid retrieval-generation reinforced agent for medical image report generation. In NeurIPS, pp. 1537–1547. Cited by: §1, §2.1, §2.1, Table 1.
  • C. Y. Li, X. Liang, Z. Hu, and E. P. Xing (2019) Knowledge-driven encode, retrieve, paraphrase for medical image report generation. In AAAI, pp. 6666–6673. Cited by: §1, §2.1, §2.1, Table 1.
  • C. Lin (2004) Rouge: a package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81. Cited by: §4.2.1.
  • G. Liu, T. H. Hsu, M. McDermott, W. Boag, W. Weng, P. Szolovits, and M. Ghassemi (2019) Clinically accurate chest x-ray report generation. In Machine Learning for Healthcare Conference, pp. 249–269. Cited by: §2.3, Table 1, Table 2.
  • J. Lovelace and B. Mortazavi (2020) Learning to generate clinically coherent chest x-ray reports. In

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings

    pp. 1235–1243. Cited by: §1, §2.1, §2.1, §2.3, Table 1, §4.1.1, §4.2.2, Table 2.
  • J. Lu, C. Xiong, D. Parikh, and R. Socher (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 375–383. Cited by: Table 1, Table 2.
  • Z. Lu, K. Deb, and V. N. Boddeti (2020a)

    Muxconv: information multiplexing in convolutional neural networks

    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12044–12053. Cited by: §2.1.
  • Z. Lu, I. Whalen, Y. Dhebar, K. Deb, E. Goodman, W. Banzhaf, and V. N. Boddeti (2020b) Multi-objective evolutionary design of deep convolutional neural networks for image classification.

    IEEE Transactions on Evolutionary Computation

    Cited by: §2.1.
  • L. Luo, L. Yu, H. Chen, Q. Liu, X. Wang, J. Xu, and P. Heng (2020) Deep mining external imperfect data for chest x-ray disease screening. IEEE Transactions on Medical Imaging 39 (11), pp. 3583–3594. External Links: Document Cited by: §2.1.
  • T. Nishino, R. Ozaki, Y. Momoki, T. Taniguchi, R. Kano, N. Nakano, Y. Tagawa, M. Taniguchi, T. Ohkuma, and K. Nakamura (2020) Reinforcement learning with imbalanced dataset for data-to-text medical report generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 2223–2236. Cited by: §2.1.
  • Y. Oh, S. Park, and J. C. Ye (2020) Deep learning covid-19 features on cxr using limited training data sets. IEEE Transactions on Medical Imaging 39 (8), pp. 2688–2700. External Links: Document Cited by: §2.1.
  • K. Papineni et al. (2002) BLEU: a method for automatic evaluation of machine translation. In ACL, pp. 311–318. Cited by: §4.2.1.
  • W. Pei, J. Zhang, X. Wang, L. Ke, X. Shen, and Y. Tai (2019) Memory-attended recurrent network for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8347–8356. Cited by: §2.1.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. Cited by: §2.2.
  • P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Ding, A. Bagul, C. Langlotz, K. Shpanskaya, et al. (2017) Chexnet: radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv preprint arXiv:1711.05225. Cited by: §2.1.
  • E. Ranjan, S. Paul, S. Kapoor, A. Kar, R. Sethuraman, and D. Sheet (2018) Jointly learning convolutional representations to compress radiological images and classify thoracic diseases in the compressed domain. In Proceedings of the 11th Indian Conference on Computer Vision, Graphics and Image Processing, pp. 1–8. Cited by: §2.1.
  • S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel (2017) Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024. Cited by: §2.1.
  • S. Singh, S. Karimi, K. Ho-Shon, and L. Hamey (2019) From chest x-rays to radiology reports: a multimodal machine learning approach. In 2019 Digital Image Computing: Techniques and Applications (DICTA), pp. 1–8. Cited by: §2.1.
  • P. Srinivasan, D. Thapar, A. Bhavsar, and A. Nigam (2020) Hierarchical x-ray report generation via pathology tags and multi head attention. In Proceedings of the Asian Conference on Computer Vision, Cited by: §1, §2.1, §2.1, §4.1.2.
  • H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller (2015) Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, pp. 945–953. Cited by: §3.1.1.
  • A. Tran, A. Mathews, and L. Xie (2020) Transform and tell: entity-aware news image captioning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 13032–13042. External Links: Document Cited by: §1, §2.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. arXiv preprint arXiv:1706.03762. Cited by: §2.1, §2.2, §3.1.2, §3.2, Table 1, Table 2.
  • P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §2.2.
  • O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2015) Show and tell: a neural image caption generator. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 3156–3164. External Links: Document Cited by: §1, §2.1, §2.1, Table 1, Table 2.
  • X. Wang, Y. Peng, L. Lu, Z. Lu, and R. M. Summers (2018) Tienet: text-image embedding network for common thorax disease classification and reporting in chest x-rays. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9049–9058. Cited by: §1, §2.1, §2.1, Table 1, Table 2.
  • Y. Xiong, B. Du, and P. Yan (2019) Reinforced transformer for medical image captioning. In International Workshop on Machine Learning in Medical Imaging, pp. 673–680. Cited by: Table 1.
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057. Cited by: §2.1, §2.1, Table 1, Table 2.
  • Y. Xue, T. Xu, L. R. Long, Z. Xue, S. Antani, G. R. Thoma, and X. Huang (2018) Multimodal recurrent model with attention for automated radiology report generation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 457–466. Cited by: §1, §2.1, §2.1.
  • C. Yin et al. (2019) Automatic generation of medical imaging diagnostic report with hierarchical recurrent neural network. In ICDM, pp. 728–737. Cited by: §1, §2.1, §2.1.
  • C. Yin, B. Qian, J. Wei, X. Li, X. Zhang, Y. Li, and Q. Zheng (2019) Automatic generation of medical imaging diagnostic report with hierarchical recurrent neural network. In 2019 IEEE International Conference on Data Mining (ICDM), pp. 728–737. Cited by: Table 1.
  • Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo (2016) Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4651–4659. Cited by: Table 1.
  • J. Yuan, H. Liao, R. Luo, and J. Luo (2019) Automatic radiology report generation based on multi-view image fusion and medical concept enrichment. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 721–729. Cited by: §1, §2.1, §2.1.
  • Y. Zhang, X. Wang, Z. Xu, Q. Yu, A. Yuille, and D. Xu (2020)

    When radiology report generation meets knowledge graph

    In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 12910–12917. Cited by: §2.1.
  • J. Zhu, T. Park, P. Isola, and A. A. Efros (2017)

    Unpaired image-to-image translation using cycle-consistent adversarial networks

    In ICCV, Cited by: §3.3.