Log In Sign Up

Attributed Abnormality Graph Embedding for Clinically Accurate X-Ray Report Generation

Automatic generation of medical reports from X-ray images can assist radiologists to perform the time-consuming and yet important reporting task. Yet, achieving clinically accurate generated reports remains challenging. Modeling the underlying abnormalities using the knowledge graph approach has been found promising in enhancing the clinical accuracy. In this paper, we introduce a novel fined-grained knowledge graph structure called an attributed abnormality graph (ATAG). The ATAG consists of interconnected abnormality nodes and attribute nodes, allowing it to better capture the abnormality details. In contrast to the existing methods where the abnormality graph was constructed manually, we propose a methodology to automatically construct the fine-grained graph structure based on annotations, medical reports in X-ray datasets, and the RadLex radiology lexicon. We then learn the ATAG embedding using a deep model with an encoder-decoder architecture for the report generation. In particular, graph attention networks are explored to encode the relationships among the abnormalities and their attributes. A gating mechanism is adopted and integrated with various decoders for the generation. We carry out extensive experiments based on the benchmark datasets, and show that the proposed ATAG-based deep model outperforms the SOTA methods by a large margin and can improve the clinical accuracy of the generated reports.


page 2

page 5

page 11

page 13


Auto-Encoding Knowledge Graph for Unsupervised Medical Report Generation

Medical report generation, which aims to automatically generate a long a...

Automated Knee X-ray Report Generation

Gathering manually annotated images for the purpose of training a predic...

Chest X-ray Report Generation through Fine-Grained Label Learning

Obtaining automated preliminary read reports for common exams such as ch...

Prior Knowledge Enhances Radiology Report Generation

Radiology report generation aims to produce computer-aided diagnoses to ...

Automatic Radiology Report Generation based on Multi-view Image Fusion and Medical Concept Enrichment

Generating radiology reports is time-consuming and requires extensive ex...

Self adaptive global-local feature enhancement for radiology report generation

Automated radiology report generation aims at automatically generating a...

XRayGAN: Consistency-preserving Generation of X-ray Images from Radiology Reports

To effectively train medical students to become qualified radiologists, ...

1 Introduction

Automatic generation of medical reports from X-ray images has recently been studied with the objective to assist radiologists to perform the time-consuming and yet important reporting task. An X-ray report, as shown in Fig. 1, typically contains a paragraph with multiple sentences describing the abnormalities identified by the radiologist in the images (called findings) and a short conclusion (called impression). For the generated report to be clinically accurate, findings of abnormalities revealed in the X-ray images should be correctly reported.

Fig. 1: Illustration of extracting abnormalities and attributes from the annotated radiology report and the RadLex ontology.

In the literature, the deep encoder-decoder architecture has been found effective for the medical report generation, where visual features were extracted from the input medical images using a convolutional neural network (CNN) and fed to a recurrent neural network (RNN) to generate the report 

[40, 9, 38, 41]. Some recent work [5, 27, 21]

replaces the decoder with a Transformer-based architecture for further text quality enhancement. Other than improving the fluency and readability text of generated report, some study also attempt to increase the clinical keywords accuracy using reinforcement learning 

[24, 14, 19], For enhancing the clinical accuracy, semantic annotations [15] (see also Fig. 1) and concepts extracted from the medical reports [44] have been used to learn semantic features to assist the report generation.

Recently, the knowledge graph approach integrated into a deep model architecture  [18, 45, 21, 23] has been shown effective to further enhance the accuracy. Among the existing knowledge-graph based medical report generation methods, it is common that careful manual effort is required to construct the abnormality graph. It is inevitable to result in a sub-optimal design. In addition, we notice that a medical report typically contain not only information about the observed abnormalities (e.g., “calcified granuloma”), but also their associated “attributes” (e.g., “left upper lobe” as its location). Therefore, it becomes important to represent well both the abnormalities and their attributes in order the generated reports can recover the related details. However, related research work where attributed abnormalities are explicitly represented is still rare. Our conjecture is that constructing a knowledge graph with higher granularity of abnormalities and attributes is vital for enabling the generation of clinically accurate reports. Orthogonal to this direction, retrieving relevant clinical templates for rewriting is another trick to ease the report generation task  [19, 3, 18, 33, 21].

In this paper, we focus on investigating how the knowledge graph approach can be better exploited for medical report generation. We first propose the adoption of a novel fine-grained knowledge graph structure to represent the attributed abnormalities. To attain such a fine-grained ATtributed Abnormality Graph (ATAG), we propose a methodology to automatically construct it based on annotated X-ray datasets and the RadLex radiology lexicon. In ATAG, each attributed abnormality is represented using an abnormality node and an associated set of attribute nodes to model the abnormality details. The inter-related abnormalities and attributes are connected. The ATAG forms a global medical knowledge graph, and is then integrated into a deep encoder-decoder model architecture to learn its embedding with the objective to achieve highly accurate abnormality classification and high quality radiology report generation. In addition, a novel gate mechanism is designed to allow the information encoded by ATAG more effectively incorporated into both LSTM- and Transformer-based decoder for clinically accurate report generation. Our experimental results based on the publicly available IU-XRay dataset [7] and MIMIC-CXR dataset [16]

show that the use of ATAG can achieve a higher accuracy on abnormality classification. Also, it can generate more clinically accurate reports compared to the SOTA methods according to the natural language generation metric scores and the medical report quality metrics by a large margin. To summarize, the main contributions of this paper include:

  1. A methodology to automatically construct a novel fine-grained attributed abnormality graph (ATAG) representing the abnormalities and their associated attributes;

  2. An algorithm to learn the attributed abnormality representations of ATAG using graph attention networks;

  3. A gating mechanism to effectively integrate the ATAG with various decoders to generate the detailed radiology report with clinically accurate attributed abnormalities.

2 Related Work

The earliest efforts on generating textual output based on visual input are for automatic image captioning 

[39, 37, 43, 31, 1]. The image captioning task typically aims at generating one sentence to describe the objects, their attributes, and the underlying scenes revealed in the given image. Image paraphrasing [17] is a related task which focuses on generating multiple sentences (i.e., paragraphs), instead of just one sentence. The objective is to provide more detailed object-related descriptions or a long sentence with details about the main objects in the image.

2.1 Radiology report generation

Generating radiology reports, in a sense, is similar to the image paraphrasing task which takes an X-ray image as input and output a text with multiple sentences. Each sentence usually focus on one particular topic, i.e., a clinical observation, with some fine-grained supporting details revealed in the input medical image. In the literature, deep learning based methods using CNN encoder and RNN decoder for the report generation have been found promising 

[40, 9, 38, 41].

2.2 Use of semantic features/labels

To achieve the clinically accurate report generation, many recent work that makes use of extra labels under the deep encoder-decoder framework for generating clinically accurate reports. Yuan et al. [44] extracted 69 medical concepts from the medical reports using Semrep (, and trained a CNN for concept classification and report generation. Jing et al. [15] generated 572 tags for the whole dataset using Medical Text Indexer (MTI), and generated reports using both the semantic features from the tags and the visual features. Alternatively, Park et al. [30] utilized 210 tags and multi-level visual features to facilitate the report generation. Siddharth et al. [3] learned to predict 14 disease labels based on the visual features and fed them to the report generation decoder. Syeda-Mahmood et al. [33] made use of more fine-grained labels (with 78 unique abnormalities and 9 attributes) generated from the reports. Miura et al. [27] first learned an image-to-text model, and then fine-tuned it by increasing the number of the matched clinical entities in the generated report. All the aforementioned methods reply on either some taggers or manual effort to select the concepts and the abnormalities.

2.3 Use of knowledge graphs

There also exist methods proposed to organize the medical concepts and abnormalities using a knowledge graph. Li et al. [18] proposed the use of an abnormality-and-attributes graph where the nodes correspond to 80 abnormality phrases (manually chosen) frequently appearing in reports and the edges are constructed according to their occurrence frequencies. A graph Transformer is proposed to dynamically transform the image-to-graph features and graph-to-text features with the attention mechanism. Zhang et al. [45] manually constructed a knowledge graph with 20 common abnormalities where the nodes are connected according to the body parts they appeared. Liu et al. [21] further integrated this 20-abnormalities knowledge graph and report templates as the prior knowledge for report generation. In this paper, we propose a methodology to automatically construct an abnormality graph with fine-grained attributes to be integrated into a deep learning model for generating reports with a higher clinical accuracy.

2.4 Aligning visual features and report contents

Some more recent research efforts try to better align the abnormal observations and the report contents. For instances, Liu et al. [22] proposed a contrastive attention mechanism to subtract the “normality” visual features from the overall visual features for the decoder to generate the depiction of observed abnormalities. You et al. [42] developed an alignment-enhanced Transformer to refine the visual features with the semantic features of disease labels by an alternative alignment mechanism. The memory mechanism has also been employed for the report generation by memorizing the visual pattern of normal/abnormal observations via external memory construction (i.e., slot-based memory module). For instance, R2Gen [5] utilizes a memory matrix to memorize the projection between the visual patterns and language patterns, which is queried given the input image and fed to the decoder in the testing stage. Similarly, a cross-modal memory network was proposed in [4] to learn the latent features of abnormalities based on the visual features and on language features in the same latent space, aiming to effectively transform the cross modalities from visual to text.

3 An Overview of the Proposed Framework

Given the radiology image with its extracted visual features denoted as as input, the objective is to generate a radiology report where refers to the sentence in the report. We first introduce a fine-grained ATtributed Abnormality Graph (ATAG) to represent the relationships of abnormalities and their associated attributes, aiming to facilitate the generation of clinically accurate reports. We then propose a methodology to automatically construct a global ATAG based on the given report corpus and the public radiology ontology. Given the ATAG structure, Graph Attention Network [36]) is adopted to learn the ATAG embedding based on the input visual feature . By taking the ATAG embedding as input, the decoder generates the final report by attending the proper attributed abnormality node embeddings in depicting different observations.

In the following sections, we first introduce the ATAG structure construction methodology (Section 4) and ATAG embedding learning algorithms (Section 5). We then present how to integrate ATAG embedding for report generation with a hierarchical attention mechanism and a gate mechanism in Section 6.

4 Abnormality Graph with Fine-grained Attributes

In an X-ray medical report, items to be reported include the names of the abnormalities and their associated details such as the corresponding anatomical part, location, status, etc. To enable clinically accurate reports to be generated, we consider that it can only be possible if a more fine-grained abnormality graph representation can be constructed to represent the abnormality details. To represent such a graph, we first define where represents a set of abnormality nodes and represents the set of edges connecting them. The abnormality nodes should be connected if they are inter-related. In addition, for each abnormality node , it is paired with an attribute graph where represent a set of associated attribute nodes and represents the set of edges connecting the attribute nodes (indicating that they are inter-related).

The attributed abnormality graph can therefore be denoted as with each corresponding to a distinct .

4.1 Extracting abnormalities from X-ray annotations

For the first step, we identify the set of abnormalities and their associated attributes to be included. Instead of manually identifying them as adopted in most of the existing methods, we propose to make use of the annotations as provided in the dataset. For instance, each image annotation in the IU XRay dataset typically contains terms about the abnormality and the associated descriptors (attributes). We adopt RadLex ( The RadLex Ontology is also employed in the chest X-ray report annotation guidance in IU XRay dataset ( which is an ontology of radiology lexicon to extract i) the abnormality term if found under RadLex’s “clinical finding” category, and ii) the associated attributes if found under RadLex’s different descriptor categories. E.g., “atelectasis” is a “clinical finding” and “right” is a “location descriptor” in RadLex. For the annotation without any clinical finding term, we use “other, [anatomical-part descriptor]” to denote the abnormality. We applied this methodology to the IU X-Ray dataset and maintained terms of which the occurrence frequency larger than the certain threshold number. The extraction results are reported in Table. I

Dataset Freq. Abn. Atr. Max. / Min. / Avg. / Std. per Abn.
IU XRay 10 41 106 47 / 1 / 11.5 / 11.0
20 28 79 40 / 1 / 13.9 / 10.6
30 23 64 34 / 1 / 14.0 / 9.7
MIMIC CXR 500 47 209 178 / 17 / 69.1 / 38.6
1000 35 165 142 / 19 / 69.4 / 30.9
2000 26 129 116 / 37 / 70.0 / 22.4
TABLE I: The statistics of abnormalities and attributes appeared in the datasets. “Freq.” stands for the frequency threshold of the extracted terms.“Abn.” and “Atr.” stand for the number of extracted abnormalities and attributes.

4.2 Extracting abnormalities from X-ray reports

We can also leverage some larger X-ray datasets (e.g., MIMIC CXR [16]) which have been made available recently. Very often detailed annotations are not provided since preparing ground-truth annotations is costly. Alternatively, we can make use of the X-ray reports provided in the dataset where information related to the abnormality and the associated descriptors can be extracted. An example is shown in Fig. 2. We can first filter out sentences of negative or inconclusive mentions using publicly available tools, i.e., clinical entity relationship parser RadGraph222RadGraph, a novel information extraction schema for radiology report information structuralization [13] and radiology entity extraction RadLex-Annotator333 [26]. The clinically related terms are first extracted from the reports using the open annotation API provided by [26]. The extracted terms under the “Clinical Finding” category are used to form the abnormality nodes and “RadLex Descriptor” category are used to form the attribute nodes in ATAG. Then, the dependency parser pre-trained on the MIMIC dataset (e.g., as in RadGraph [13]) can be used to locate the attributes of the extracted abnormalities via the dependency relationships444We consider the parsed relationships “located_at” and “modify” provided by RadGraph to determine the abnormality-to-attribute association.. By applying this methodology to the MIMIC CXR dataset, different sizes of ATAGs are extracted as shown in Table. I.

Fig. 2: Illustration of extracting attributed abnormality terms from free-text radiology reports using RadLex annotator and dependency parser.

5 Learning Attributed Abnormality Graph Embedding

We integrate the proposed ATAG into an encoder-decoder architecture similar to [45], as shown in Fig. 3. DenseNet [11] is used to extract the visual feature for computing the ATAG embedding. Specific graph attentional layers (to be detailed) are introduced to aggregate the representations from heterogeneous nodes. The ATAG embeddings are learned with the multi-abnormality and multi-attribute classification as the learning objective.

Fig. 3: The ATAG-based deep model architecture. An illustrated example of ATAG is presented in Part I, with the process of computing ATAG embedding shown in Part II and followed by the integration of ATAG and GATE with LSTM-based or Transformer-based decoder as depicted in Part III.

Given the visual features of size extracted from the frontal and lateral chest x-ray images using the DenseNet-121 [11, 45], we initialize the abnormality node features in ATAG using a spatial attention mechanism implemented by a convolutional layer . In particular, we set up channels with filter size as such that each channel outputs as the attention weight to indicate the particular image region to be attended by node . Then, the attending visual features for is computed by concatenating attention-weighted visual features


where denotes the concatenation operation. The visual feature for the global node is computed by the global average pooling of the visual features of all the other nodes.

In addition, for all the abnormalities, we also define a set of intrinsic concept embeddings to encode the a-priori information for the abnormalities and attributes. The abnormality node embeddings, denoted as , are then computed based on the attending visual features and concept embeddings using the graph attentional layer [36] on , given as



is a linear projection applied to reduce the dimension of the concatenated vector of the visual feature and concept embedding back to


For each attribute graph associated with the abnormality node , the visual features are first weighted by the channel output and then the attribute attention is computed using another convolutional layer , where


The idea is to put focus on the region where the abnormality node is attending for computing the corresponding attribute embedding. Similar to modeling abnormality, we also define for all the possible attributes the corresponding set of intrinsic concept embeddings where . The embedding of the attribute graph is then computed based on the concatenated attending visual features and the attribute concept embeddings with another linear projection applied as


For abnormality classification, and

are fed to a fully-connected layer with sigmoid functions employed. We learn the ATAG embedding end-to-end by optimizing the sum of the binary cross-entropy losses weighted by

of each , where is the report set and is the set of reports with mentioned in their annotations. Given the abnormality and attribute ground-truth labels, the total classification loss is defined as:



is a trade-off parameter between the abnormality and attribute loss functions, and

. In the training process, the global node is used to predict the existence of “no finding / normal” label, and is used to predict the existence of any attribute labels associated with .

6 Report Generation with ATAG Embedding

After obtaining the abnormality graph embedding and attribute graph embedding , a context vector is derived to guide the generation of the report . With the objective to adaptively align specific information captured in and while different sentences in the report are being generated, we make reference to the decoder’s hidden state (denoted as ) and propose a hierarchical attention mechanism for computing the attributed abnormality context vector, and a gating mechanism for adapting the abnormality graph embedding and attribute graph embedding . We will also show how the two mechanisms can be applied to LSTM- and Transformer-based decoders.

6.1 Hierarchical Attention on ATAG Embeddings

Given the decoder’s current hidden state as the query, we can compute the attention to the attribute embeddings and then aggregate them. We then further compute the attention with reference to different abnormalities to implement the hierarchical attention.

We denote the overall ATAG context vector at time step as which is computed by aggregating and using a hierarchical attention mechanism , given as


Aggregating attribute graph embedding with attention. We first compute the aggregated context vector of the attribute graphs as:


where in which and , respectively, and with and being the learnable parameters. This aggregated attribute graph embedding aims to maintain the detailed information of clinical attributes, e.g., the positions “left” and “central” together with anatomical part “subclavian” of medical device “catheter tip”.

Attributed abnormality context vector. The attributed abnormality context vector is then computed by combining the aggregated attribute context and abnormality embedding as:


where . The corresponding attention values are expected to indicate the attending abnormalities and attributes nodes for the generation of the next sentence or token. Note that the embedding of the global abnormality node is computed by global average pooling.

6.2 Gating Mechanism for Adaptive Graph Embeddings

The knowledge graph embedding is expected to facilitate the report generation using the decoder, and it is mostly assumed to be unchanged while generating different abnormality observations [18, 45, 21, 23]. As sentences in a radiology report are correlated, the sentences generated so far should affect the next sentence to be generated and the embedded knowledge required for the remaining decoding may also evolve accordingly. To facilitate this dynamic decoding process, we propose a gating mechanism to allow the ATAG embedding and to evolve over time during the report generation. Given the current hidden state of the decoder , the graph embedding and the context vector as inputs, the gating mechanism is denoted as:


To prevent from gradient exploding due to the long sequence generation, we first compute an incremental graph embedding

using a two-layer full connection neural network with residual connection by taking

and as the input, given as




is the activation function of Gaussian error linear units,

are the learnable weight matrices, and are the bias terms. is expected to differentiate the attended and non-attended embeddings in the with reference to .

In addition, to allow the knowledge embedding to be refined to ease the decoder for the subsequent decoding, a forget gate and an input gate are employed to determine the present and absent graph embeddings with reference to the hidden state, given as


where , and are the learnable parameters. Then, is computed according to the forget context and input context as:


The resulting is expected to have some of the abnormality and attribute node embeddings already mentioned in the generated contents “forgotten”. Such a refined graph embedding could progressively confine the number of the abnormality and attribute node embeddings to be attended by the decoder.

6.3 Decoding with Hierarchical LSTM

Two-level long short-term memory (LSTM) network 

[8] is commonly used as the decoder for the report generation, where a top-level LSTM is used to predict the abnormalities (topics) for each sentence, and a bottom-level LSTM is to generate the description for the particular abnormalities. Given the ATAG context vector , the top-level generates the sentence topic at step as:


where the hidden state is initialized by concatenating the global average of abnormality and attribute node embeddings. The bottom level computes the hidden state for each word and generates the word sequence for the sentence as:


where each word is predicted by in the generated report where and are learnable parameters.

Integration with gating mechanism. The bottom-level generates the sequence of words guided by the topic information in of the top-level and the context vector attended by . However, the actual generated content only covers parts of the topic, i.e., . The inconsistency between the two-level LSTMs limits the decoder to generate the completed descriptions of detected abnormalities and attributes. In addition, each sentence is generated by attending proper attributed abnormality embedding from a large volume of embeddings555For example, ATAG (41+106) for IU XRay has 41 abnormality nodes and 472 attribute nodes in total, where each node is associated with a embedding vector.. The effectiveness of attention module is expected to be improved if the values of relevant node embeddings are boosted.

To supplement the attended but not generated context in generating the next sentences and enhance the attention effectiveness, the graph embedding is expected to increase the chances for the corresponding graph embedding for the decoder. The structure is illustrated by Fig. 3. Thus, after decoding each sentence as Eq.(13), the generation module is followed by operating the gate mechanism to update the graph embedding as,


where is the individual context vector of . The aggregated hidden state is computed by the outputs of both sentence- and word-level LSTMs and as,


where is the linear projection. from the top level aims to maintain the information of attended attributed abnormality embedding that is expected to generate at -th sentence Meanwhile, is expected to retain the detailed descriptions of particular attributes in that sentences. The updated and are then expected to enhance to look up the proper context embedding in the following generation. The chances of accurately describing abnormalities and attributes in the report are thus increased.

6.4 Decoding with Transformer

To generate the proper report by attending the corresponding graph emebeddings, we also integrate the proposed ATAG with the effective Transformer-based decoder. The Transformer-based decoder is constructed by multi-head attention (MHA) [34] which is composed by multiple parallel the scaled dot-product attention modules. Given the mixed embedding of each word by word embedding and positional embedding , the decoder aims to generate each word with -layer MHA. For -th layer, the hidden state is computed by ,


Then, is decoded by and with and as,


where and are followed by dropout [32], skip connection [10] and layer normalization [2] to alleviate the data bias. We use the last layer output to predict each word .

Integration with gating mechanism. The hidden state is generated by attending the preceding token sequence which covers multiple observations in the multiple sentences generated. Thus, evolving graph embedding by token-level would cause to “forget” both attended node embeddings in the preceding sentences and non-attended node embeddings that would be attended in the current sentence .

To evolve the graph embedding by the actually attended attributed abnormalities in the generated content, the corresponding fine-grained is first to be computed by -times recursion inside each Transformer layer. As illustrated in Fig. 3, for -th recursion in -th layer, the context vector of attributed abnormality embedding is computed as,


where , and . For -th recursion, is initialized by , and the last recursion is taken as the output for -th layer. Accordingly, the graph embedding is accordingly evolved by as,


where and are taken as the and for the following generation. In this way, the gating mechanism will be performed times to refine graph embedding in a fine-grained manner. It is expected to enhance the Transformer decoder to generate the accurate descriptions of abnormalities and associated attributes.

7 Clinical Accuracy Evaluation

To measure the clinical accuracy of the generated radiology reports in a more fine-drained manner, we propose a new metric Radiology Report Quality Index (RadRQI) to evaluate the accuracy of radiology-related abnormalities with the clinical attributes.

Given a radiology report, the radiology-related terms of abnormality and their attributes are first extracted using the RadLex ontology [26]666An open annotation API is provided by at The keywords if found under RadLex’s category i) “Clinical Finding” will be taken as abnormalities, and ii) “Clinical Descriptor” will be taken as clinical attributes. An example is shown in Fig. 1. Next, the negation and association of each abnormality or attribute is determined using the RadGraph, an entity and relation parser trained by CheXpert [12] and MIMIC CXR [16] radiology report datasets. The clinical-related terms will be taken as “Positive” if labeled as “Definitely Present”, and ii) “Negative” if labeled as “Definitely Absent”. The association relationship from attribute to abnormality is determined if the “Modify” or “Located At” relation is detected. As a result, the tuples of the form “(Abnormality, Negation, [Attribute1, Attribute2, …])” are extracted for the subsequent RadRQI score calculation. Similar to previous work [7, 5, 27]

, the probable (but not definite) existence of clinical findings are not considered.

By extracting the abnormalities together with their attributes from both the generated and ground truth reports, the precision, recall and F-1 measure scores are computed for each abnormality category. For each “positive” mentioned abnormality, similar to [45], the number of True Positives (TP) considers also the correct hits of the corresponding attributes,


where is the weight of the attribute term accuracy to determine its contribution in the overall TP calculation. The proposed RadRQI-F1 score aims to reflect the correctness of mentioned abnormality with associated attributes. In addition, the number of abnormality categories with non-zero F1 score, denoted as RadRQI-Hits, is also reported to show the coverage of distinct abnormality categories in the generated reports. To avoid rare abnormalities, we compute RadRQI by considering only top- abnormalities in terms of their frequencies in the datasets.

Noted that Medical Image Report Quality Index (MIRQI), proposed in  [45], also measures the correctness of attributes of the mentioned abnormalities in the generated reports. However, MIRQI evaluates a small set of abnormalities which covers 12 disease categories (labeled by CheXpert labeling toolkit) and takes irrelevant words (e.g., stop words like “is” and “no”) as the corresponding attributes of some abnormalities. Meanwhile, MIRQI does not consider terms which are found in the ground truth but not mentioned in the generated report, nor those found in the generated reports but not mentioned in the ground truth. By ignoring those not-mentioned terms in the evaluation, it will favor methods which keep generating only a few correct abnormalities but missing many others, thus resulting in misleading evaluation results. As illustrated in Fig. 4

, by ignoring the not-mentioned terms, the precision and recall calculated by MIRQI is

and , respectively. While for RadRQI, counting also the not-mentioned terms, the precision and recall become and , respectively. Comparing with the evaluation results of MIRQI which gives a high score to the partially correct generated report, RadRQI gives more reliable evaluation of the medical term accuracy in the generated report.

Fig. 4: Illustration of MIRQI and the proposed RadRQI for calculating the precision and recall of a generated report with two positive mentions and three negative mentions of abnormalities.

8 Experiments

8.1 Datasets and Evaluation Metrics

We use two publicly available datasets IU X-Ray [7] and MIMIC CXR [16] for performance evaluation. The statistics of the datasets are shown in Table. II. For the IU X-Ray dataset, similar to [18, 45]

, we extract only the reports with both frontal and lateral view images, complete finding/impression sections and annotations available, resulting 2,848 cases and 5,696 images. We tokenize all the words in the reports and filter out tokens with frequency less than three, resulting in 1,028 unique tokens. We partition the data into training/validation/test set by 7:1:2 for five-fold cross validation. For the MIMIC CXR dataset, we apply an open source tool

777 to extract findings/impression sections as the target report and filter out tokens with frequency less than 10, resulting 4,936 distinct tokens and following the original split set with training/validation/test size as 222,705 / 1,807 / 3,269. We report the average performance scores of three different runs.

Dataset IU XRay [7] MIMIC CXR [16]
Image # 7,470 473,057
Report # 3,996 206,563
Case # 3,996 63,478
Avg. Len. 38.3 53.2
Avg. Sentence # 5.8 5.5
Avg. Sentence Len. 6.5 10.8
TABLE II: The statistics of the datasets used in our experiments.

Regarding evaluation metrics, the AUC of Receiver Operating Characteristic (ROC) curve and Precision Recall (PR) curve are used for measuring the multi-label classification performance. The micro-average score is reported. For report quality, we adopt the common natural language generation metrics like BLEU 

[29], ROUGE [20] and CIDEr [35] which measure the similarity between the generated report and the ground truth. For clinical accuracy of the generated report, we adopt the clinical efficacy metrics (CE) [5] and its modifications [27, 18] to evaluate the accuracy of a series of observation presence status comparing with the ground truth. We use the CheXpert labeling toolkit888 to label 12 different thoracic diseases together with “medical device” and “normality”. The micro-average F1 scores and the average number of classes which have non-zero F1 scores are reported, denoted as CE(Hits). We also adopt the proposed RadRQI metric to evaluate the clinical accuracy of a large number of abnormalities and attributes. We focus on more common abnormalities and attributes, and thus measure the RadRQI scores for top-25 and top-50 for IU XRay and MIMIC CXR, respectively. The is set to 0.5 indicating the equal importance of abnormalities and their associated attributes.

8.2 Baselines for Performance Comparison

We first evaluate the performance of ATAG for the multi-label classification. We compare variants of ATAG with a number of baselines where DenseNet [11]

is adopted for the visual feature extraction but with different encoders and different number of labels considered, denoted as

DenseNet[+Encoder]([# Labels]). The encoders tested include fully-connected layer (DenseNet) [12], knowledge graph (DenseNet+KG) [45], the abnormality graph in ATAG (DenseNet+AG), and the ATAG with both the abnormality graph and the set of associated attribute graphs (DenseNet+ATAG). Regarding the labels, “(20)” refers to the 20 labels used in [45]. For IU X-Ray dataset, “(41)” corresponds to the 41 abnormalities in ATAG, and “(41+106)” to further 106 distinct attributes included. For MIMIC CXR dataset, “(47)” refers to the 47 abnormalities in ATAG, and “(47+209)” to further 209 distinct attributes included. DenseNet+KG(20) is equivalent to [45], and DenseNet+ATAG(41+106) and DenseNet+ATAG(47+209) refer to our proposed methods. All input images are resized to before feeding into the DenseNet and no normalizing pre-processing is further adopted.

To evaluate the effectiveness of the ATAG-based approach for report generation, we integrate ATAG with both LSTM-based decoders and Transformer-based decoders. For the LSTM-based decoders, we tested five state-of-the-art LSTM-based report generation models as the baselines, including the classical CNN-RNN model WordSAT [39] with a one-level LSTM decoder, AdaAttn model [25] with an adaptive attention module and a one-level LSTM decoder, SentSAT [44] with a two-level LSTM decoder, CoAtt [15] with additional label features in addition to SentSAT, and SentSAT+KG [45] which utilizes a knowledge graph with 20 abnormalities. Accordingly, SentSAT+ATAG and SentSAT+ATAG+GATE refer to the basic two-level LSTM integrated with our proposed ATAG and GATE modules. For the Transformer-based decoders, we integrate ATAG with the basic Transformer (Trans+ATAG) only and also together with the proposed GATE module (Trans+ATAG+GATE). We compare the performance of the proposed ATAG-based Transformer models with the vanilla Transformer Transformer, the state-of-the-art image captioning model M2 Trans [6] with memory-enhanced Transformer encoder999 and two open source report generation models R2Gen101010 [5] and R2Gen-CMN111111 [4].

8.3 Experiment Settings

We adopt the DenseNet-121 pretrained on the CheXpert dataset121212 as the visual encoder. We use the implementation in the deep graph Python library131313 for the GAT used in our graph embedding learning. For the report generation, the dimension of the hidden states in all LSTM decoders is 512. For the Transformer-based decoders, the dimensions of hidden states, the number of heads, the number of layers and the number of looping are set to be 512, 8, 2 and 3 respectively. Two-phrase training is adopted where the encoder is trained and then fixed during the training of the decoder [45]

. For IU XRay, the encoder is trained with the learning rate 1e-6 for 150 epochs, followed by the decoder with the learning rate 1e-4 for 100 epochs. The mini-batch size for the training is 8. For MIMIC CXR, the encoder and decoder are trained for 32 epochs using the mini-batch size of 16 and the learning rates 1e-6 and 1e-4, respectively.

8.4 Performance on Multi-label Classification

We report the ROC-AUC and PR-AUC scores of all the models for comparing the classification accuracy, as shown in Table III. The models trained using the abnormalities and attributes we extracted give significantly better prediction results. Also, the use of ATAG gives the best ROC-AUC and PR-AUC scores on average, implying the effectiveness of incorporating the attributed abnormality graph for the classification. Comparing the models with and without ATAG, it is clear that the attributes introduced in ATAG do lead to accuracy improvement.

Dataset Model Abn. Atr.
IU XRay ROC-AUC (std.)
DenseNet (20) [12] 0.7400.019 -
DenseNet + KG (20) [45] 0.7280.002 -
DenseNet (41) 0.8900.009 -
DenseNet + AG (41) 0.8880.003 -
DenseNet (41+107) 0.8840.012 0.5600.054
DenseNet + ATAG (41+107) 0.8920.006 0.6860.069
DenseNet (20) [12] 0.0920.024 -
DenseNet + KG (20) [45] 0.5950.103 -
DenseNet (41) 0.7930.099 -
DenseNet + AG (41) 0.7950.102 -
DenseNet (41+107) 0.8010.109 0.5300.104
DenseNet + ATAG (41+107) 0.8100.110 0.7990.132
DenseNet (47) 0.8970.001 -
DenseNet + AG (47) 0.9160.005 -
DenseNet (47+209) 0.8940.003 0.5650.031
DenseNet + ATAG (47+209) 0.9070.006 0.6830.058
PR-AUC (std.)
DenseNet (47) 0.5100.088 -
DenseNet + AG (47) 0.5190.120 -
DenseNet (47+209) 0.5130.103 0.4400.201
DenseNet + ATAG (47+209) 0.5290.090 0.5090.135
TABLE III: Performance on multi-label classification (AUC) over all the categories being trained. “Abn.” and “Atr.” stand for abnormality classification and attribute classification, respectively.

8.5 Performance on Report Generation

Dataset Model Clinical Efficacy RadRQI NLG
(5) (14-1) (14) Hits (5) (14-1) Top-K Hits B. R. C.
IU XRay LSTM-based Model
WordSAT (20) [39] 0.085 0.074 0.175 4.8 0.019 0.018 0.024 3.6 0.262 0.369 0.317
SentSAT (20) [44] 0.087 0.083 0.171 5.6 0.012 0.013 0.030 5.8 0.261 0.363 0.344
CoAttn (20) [15] 0.056 0.068 0.167 5.2 0.009 0.013 0.023 5.0 0.274 0.365 0.318
SentSAT+KG (20) [45] 0.061 0.069 0.173 4.8 0.012 0.012 0.024 3.6 0.275 0.374 0.351
WordSAT (41) [39] 0.194 0.140 0.249 5.6 0.074 0.065 0.060 6.4 0.267 0.369 0.359
AdaAttn (41) [25] 0.203 0.147 0.258 6.6 0.070 0.066 0.068 7.6 0.269 0.367 0.358
SentSAT (41) [44] 0.223 0.164 0.268 7.6 0.067 0.064 0.061 9.8 0.272 0.362 0.326
CoAttn (41) [15] 0.143 0.108 0.220 5.8 0.046 0.045 0.055 6.8 0.259 0.364 0.340
SentSAT (41+106) [44] 0.157 0.123 0.229 5.6 0.052 0.048 0.056 6.4 0.261 0.357 0.307
SentSAT+AG (41) 0.164 0.110 0.227 4.8 0.078 0.054 0.043 4.6 0.323 0.374 0.297
SentSAT+ATAG (41+106) 0.190 0.145 0.244 7.2 0.062 0.065 0.069 10.2 0.255 0.351 0.356
SentSAT+ATAG+GATE (41+106) 0.216 0.178 0.263 8.6 0.066 0.068 0.079 12.4 0.264 0.349 0.349
Transformer-based Model
Transformer [34] 0.124 0.112 0.310 5.0 0.052 0.038 0.072 9.0 0.264 0.357 0.587
M2 Trans. [6] 0.130 0.111 0.205 8.0 0.029 0.030 0.040 8.0 0.255 0.367 0.313
R2Gen [5] 0.115 0.127 0.289 9.0 0.040 0.057 0.071 10.0 0.251 0.342 0.461
R2Gen-CMN [4] 0.098 0.121 0.290 8.0 0.034 0.037 0.056 10.0 0.294 0.370 0.681
Trans.+AG (41) 0.207 0.179 0.277 9.6 0.063 0.054 0.069 6.4 0.256 0.357 0.304
Trans.+ATAG (41+106) 0.184 0.178 0.262 10.0 0.050 0.047 0.072 13.8 0.246 0.334 0.334
Trans.+ATAG+GATE (41+106) 0.230 0.207 0.279 10.2 0.059 0.057 0.074 12.6 0.256 0.341 0.380
MIMIC CXR LSTM-based Model
WordSAT (47) [39] 0.326 0.294 0.290 10.0 0.109 0.099 0.174 17.3 0.160 0.249 0.082
AdaAttn (47) [25] 0.367 0.338 0.334 12.0 0.135 0.130 0.177 25.3 0.151 0.248 0.096
SentSAT (47) [44] 0.366 0.329 0.326 11.3 0.122 0.121 0.186 20.0 0.182 0.252 0.073
CoAttn (47) [15] 0.315 0.288 0.286 9.0 0.121 0.108 0.164 17.7 0.181 0.253 0.070
SentSAT (47+209) [44] 0.359 0.315 0.312 12.0 0.139 0.131 0.181 22.0 0.178 0.247 0.065
SentSAT+AG (47) 0.313 0.301 0.367 8.0 0.096 0.081 0.139 16.7 0.175 0.250 0.068
SentSAT+ATAG (47+209) 0.367 0.304 0.301 13.0 0.123 0.131 0.188 27.5 0.181 0.249 0.080
SentSAT+ATAG+GATE (47+209) 0.403 0.353 0.291 13.0 0.125 0.135 0.196 28.0 0.176 0.250 0.079
Transformer-based Model
Transformer [34] 0.279 0.269 0.267 13.0 0.095 0.087 0.188 27.0 0.126 0.164 0.167
M2 Trans. [6] 0.440 0.391 0.385 13.3 0.140 0.135 0.231 35.7 0.159 0.250 0.100
R2Gen [5] 0.268 0.298 0.293 13.0 0.098 0.104 0.159 26.0 0.124 0.160 0.170
R2Gen-CMN [4] 0.313 0.309 0.305 10.0 0.119 0.108 0.182 26.0 0.123 0.163 0.128
Trans.+AG (47) 0.400 0.371 0.367 12.7 0.150 0.152 0.215 33.7 0.142 0.237 0.086
Trans.+ATAG (47+209) 0.442 0.400 0.395 14.0 0.158 0.164 0.258 40.0 0.151 0.227 0.166
Trans.+ATAG+GATE(47+209) 0.417 0.377 0.372 14.0 0.149 0.172 0.266 41.0 0.145 0.225 0.160
TABLE IV: Performance comparison of report generation models evaluated by two clinical accuracy metrics and NLG metrics. “Top-K” are set to top-25 and top-50 in IU XRay and MIMIC CXR, respectively. The best scores are in bold face and the second best are underlined. “B.”, “R.” and “C.” stand for BLEU, ROUGE and CIDEr scores.

To evaluate the effectiveness of generating clinically accurate reports, we adopt the clinical efficacy metric and our proposed metric RadRQI. The clinical efficacy score essentially measures the accuracy of 14 clinical observations14141414 clincal observations includes: No finding, Enlarged Cardiomediastinum, Cardiomegaly, Lung lesion, Lung opacity, Edema, Consolidation, Pneumonia, Atelectasis, Pneumothorax, Pleural effusion, Pleural other, Fracture, Support devices by comparing CheXpert-based labeling results obtained from the generated reports and the ground truth. The CE score reflects the correctness of the existence status of certain abnormalities and normality mentioned in the generated report. In Table IV, we report the accuracy of 5 of 14 most represented observations, denoted as CE(5)151515Five most represented observations includes: Atelectasis, Cardiomegaly, Consolidation, Edema, and Pleural effusion.  [27], 13 of 14 abnormality observations, denoted as CE(14-1)161616Excluding No finding from 14 clinical observations [19] and 14 clinical observations, denoted as CE(14) [5, 4, 28].

For models using LSTM-based decoders, we notice that ATAG+GATE can enhance them to generate more clinically accurate reports based on the CE metrics. It suggests that integrating our proposed ATAG and the gating mechanism with the LSTM decoder can enhance both accuracy and coverage of present abnormalities in the generated reports.

We also test models with Transformer-based decoders Trans.+ATAG. They obtain either the best or comparable scores of CE(5), CE(14-1) and abnormality coverage CE(Hits), indicating that a more powerful decoder can better utilize the ATAG embedding to enhance the clinical accuracy of the generated reports. In particular, we notice that Transformer can achieve a high CE(14) score for IU XRay but not for CE(14-1) and CE(Hits) scores. This is due to the fact that the model generates common sentences like “No finding” which make the accuracy of the “normality” high. Without taking into account “normality”, CE(14-1) and CE(Hits) scores drop sharply, indicating that many abnormalities are in fact missed.

We also compare methods using the RadRQI score which corresponds to the accuracy based on a larger number of abnormalities and their associated attributes. In general, the value of RadRQI score is lower than the CE score because the evaluation is more strict with the accuracy of the associated attributes also taken into consideration. Our results again show that the ATAG-based models outperform the existing methods based on the RadRQI score.

This affirms the effectiveness of ATAG-based models to generate clinically accurate reports with more fine-grained details. Also, they can cover more abnormalities in the generated reports. According to Table IV, Trans.+ATAG(+GATE) is able to detect at least five more abnormalities than the evaluated baseline models. By further contrasting different decoders integrated with ATAG, those integrated with LSTM-based decoders perform better for the IU XRay dataset (smaller scale). Meanwhile, an average 61.9% improvement can be achieved by integrating ATAG with the Transformer-based decoder for the MIMIC CXR dataset.

To evaluate the language quality of the generated reports, we adopt language quality metrics (e.g., BLEU), as reported in Table. IV. We notice that the gain in clinical accuracy due to the incorporation of the proposed ATAG and GATE does not compromise the language quality. Noted that we used only the vanilla Transformer as the decoder in the experiment. We anticipate that more powerful decoders (such as MemoryTrans. [5] and AlignTrans. [42]), if adopted, should be able to further enhance the overall performance.

(a) LSTM-based Model in IU XRay
(b) Transformer-based Model in IU XRay
(c) LSTM-based Model in MIMIC CXR
(d) Transformer-based Model in MIMIC CXR
Fig. 5: Performance of the baselines and our proposed approach with respect to various settings of Top-K in calculating the RadRQI score.

8.6 Sensitivity Analysis

8.6.1 Size of the ATAG

To analyze how the size of the introduced ATAG affects the clinical accuracy of the generated reports, we construct ATAGs with different numbers of abnormalities and attributes. To do that, we put different thresholds on the occurrence frequency of the abnormalities and attributes in the dataset, as shown in Table. I. Table. V shows the performance of the models with ATAGs of different size.

In general, the ATAG of larger size can cover more abnormalities so that the decoder has a higher chance to generate descriptions with more abnormalities and attributes. With reference to the RadRQI(14-1) and RadRQI(TopK) columns, the larger ATAG can improve the accuracy of certain and overall abnormalities with associated attributes. One possible reason could be that models with more detailed abnormalities and their relationships captured with the help of ATAGs can allow them to distinguish better different abnormalities.

For the smaller dataset IU XRay, the improvement from ATAG(28+79) to ATAG(41+106) is marginal which is probably due to the fact that the additional abnormalities and attributes are rare and capturing them as well does not contribute much to the overall accuracy. When the dataset is large (like MIMIC CXR), the improvement gain due to increase in the ATAG size is obvious. E.g., Transformer-based decoder with ATAG (47+209) can achieve an improvement of 15.0% on RadRQI(TopK). This observation is important because the setting is closer to the real-world situations. We also observe that the adoption of LSTM-based decoders could limit the effectiveness of ATAG for large datasets. For instance, the highest RadRQI(TopK) scores of SentSAT+ATAG is obtained by ATAG(35+165) and larger ATAGs in fact decrease the performance in our experiment. It also indicates the optimal design of ATAG will depend on the decoder integrated.

Dataset (Abn.# + Atr.#) Model Clinical Efficacy RadRQI
(5) (14-1) (14) Hits (5) (14-1) Top-K Hits
IU XRay LSTM-based Model
(41+106) SentSAT 0.157 0.123 0.229 5.6 0.052 0.048 0.056 6.4
SentSAT+ATAG 0.190 0.145 0.244 7.2 0.062 0.065 0.069 10.2
SentSAT+ATAG+GATE 0.216 0.178 0.263 8.6 0.066 0.068 0.079 12.4
(28+79) SentSAT 0.160 0.114 0.224 5.2 0.043 0.042 0.052 7.2
SentSAT+ATAG 0.201 0.164 0.223 8.8 0.055 0.058 0.060 9.2
SentSAT+ATAG+GATE 0.208 0.166 0.244 8.4 0.049 0.050 0.074 10.6
(23+64) SentSAT 0.209 0.168 0.230 10.2 0.036 0.040 0.060 10.0
SentSAT+ATAG 0.225 0.182 0.239 9.2 0.048 0.054 0.067 10.4
SentSAT+ATAG+GATE 0.226 0.191 0.234 10.8 0.039 0.043 0.065 11.0
Transformer-based Model
- Transformer 0.124 0.112 0.310 5.0 0.052 0.038 0.072 9.0
(41+106) Trans.+ATAG 0.184 0.178 0.262 10.0 0.050 0.047 0.072 13.8
Trans.+ATAG+GATE 0.230 0.207 0.279 10.2 0.059 0.057 0.074 12.6
(28+79) Trans.+ATAG 0.198 0.188 0.267 10.8 0.045 0.053 0.075 13.8
Trans.+ATAG+GATE 0.208 0.189 0.271 10.2 0.049 0.051 0.074 13.3
(23+64) Trans.+ATAG 0.166 0.160 0.251 9.8 0.043 0.045 0.069 12.6
Trans.+ATAG+GATE 0.200 0.174 0.256 10.4 0.037 0.041 0.064 11.6
MIMIC CXR LSTM-based Model
(47+209) SentSAT 0.359 0.315 0.312 12.0 0.139 0.131 0.181 22.0
SentSAT+ATAG 0.367 0.304 0.301 13.0 0.123 0.131 0.188 27.5
SentSAT+ATAG+GATE 0.403 0.353 0.251 13.0 0.125 0.135 0.196 28.0
(35+165) SentSAT 0.253 0.291 0.286 13.0 0.136 0.133 0.202 20.0
SentSAT+ATAG 0.323 0.267 0.265 10.0 0.135 0.149 0.227 23.3
SentSAT+ATAG+GATE 0.378 0.334 0.331 13.0 0.135 0.141 0.223 24.6
(26+129) SentSAT 0.288 0.293 0.288 13.0 0.122 0.120 0.170 21.0
SentSAT+ATAG 0.292 0.333 0.328 13.0 0.110 0.136 0.189 27.0
SentSAT+ATAG+GATE 0.277 0.304 0.327 14.0 0.116 0.130 0.191 27.0
Transformer-based Model
- Transformer 0.279 0.269 0.267 13.0 0.095 0.087 0.188 30.0
(47+209) Trans.+ATAG 0.419 0.368 0.363 14.0 0.164 0.166 0.228 36.0
Trans.+ATAG+GATE 0.417 0.377 0.372 14.0 0.149 0.172 0.266 41.0
(35+165) Trans.+ATAG 0.408 0.411 0.400 14.0 0.140 0.151 0.235 34.0
Trans.+ATAG+GATE 0.422 0.393 0.387 13.0 0.149 0.149 0.243 35.5
(26+129) Trans.+ATAG 0.400 0.372 0.368 14.0 0.139 0.147 0.219 37.0
Trans.+ATAG+GATE 0.410 0.374 0.369 12.0 0.153 0.153 0.221 36.0
TABLE V: Performance comparison of report generation models with different sizes of ATAG on two clinical accuracy metrics. The best scores are in bold face and the second best are underlined. “Top-K” are set to top-25 and top-50 in IU XRay and MIMIC CXR, respectively.

8.6.2 Dataset-Specific Abnormalities and Attributes

Different datasets have their own sets of abnormalities and attributes where some are common and some specific to the corresponding dataset. To show the importance of taking care of dataset-specific abnormalities and attributes, we construct ATAG(32+101) which is composed of 32 abnormalities and 101 attributes shared between ATAG(41+106) learned for IU XRay and ATAG(47+209) learned for MIMIC CXR. We compare the performance of the generic model and the dataset-specific ones, as shown in Fig. 6.

With reference to the generic model ATAG(32+101), 22.8% and 3.3% improvement on RadRQI(TopK) can be obtained by modeling the specific abnormalities and attributes based on SentSAT+ATAG(+GATE). For Trans.+ATAG(+GATE), improvement of 9.9% and 29.5% can be achieved. Modeling the dataset-specific abnormalities and attributes also improves the accuracy of certain abnormalities. We observe 15.2% and 6.1% improvement on average on CE(14-1) by integrating ATAG(+GATE) with LSTM- and Transformer-based decoders.

Noted that, limited improvements are observed for RadRQI(14-1) scores. The possible explanation could be that the number of dataset-specific attributes of certain abnormalities is small, thus the related generated reports may show seldom different which makes RadRQI(14-1) scores similar.

(a) IU XRay dataset.
(b) MIMIC CXR dataset.
Fig. 6: Performance of SentSAT+ATAG+GATE and Trans.+ATAG+GATE based on a common set of anormalities or the set of anormalities specific to two different datasets.

8.7 Case Study

Fig. 7 shows two cases from the IU X-Ray dataset. For each case, we visualize the ground truth report and the reports generated by baseline models and the proposed ATAG-based models.

We apply a post-processing step of removing duplicated or short sentences from the generated reports, and show the disease/abnormality keywords extracted by CheXpert labeling toolkit used by CE metrics and RadLex+RadGraph used by RadRQI from the ground truth and the generated reports accordingly.

As illustrated, we observe that integrating ATAG can generate more accurate abnormalities, while the gating mechanism is able to further increase the accuracy of associated attributes for the mentioned abnormality. The comparison also suggests that utilizing the attributed abnormality embedding is able to facilitate detecting the correct abnormalities and associated attributes.

Yet, it is also observed that some abnormalities cannot be well distinguished due to several reasons. For example, in Fig. 7 1st case, SentSAT+ATAG, Trans+ATAG and Trans.+ATAG+GATE are reported to detect the “Cardiomegaly”, which could be caused by the denominated white regions shown in the center X-ray image which makes models hard to distinguish the heart outline and the below region. Also, when the visual pattern of “atelectasis” is detected, the Trans.+ATAG+GATE generates “less severe consolidation in the right lower lobe is either pneumonia or atelectasis” which mentions three possible abnormality observations consolidation, pneumonia and atelectasis that have similar patterns. It suggests that the generation model attempts to point out more potential abnormalities so that the present abnormal observations will not be missed as far as possible.

Fig. 7: Illustration of reports generated by the baseline model and models integrated with ATAG (yellow background color) and GATE (orange background color) on the IU XRay dataset. The first section (1st row) is the ground truth reports, second section (2nd-5th rows) is the generated report by LSTM-based models and third section (6th-9th rows) is the generated report by Transformer-based models. The correct abnormality and attribute terms are highlighted with green and blue colors. The correct term The expert-labeled annotations provided by [7] are also attached.

9 Conclusion

In this paper, we propose to automatically construct a fine-grained attributed abnormality graph (ATAG) and the corresponding embedding for representing abnormalities in X-ray images. In particular, an ATAG with an abnormality graph of which each node is paired with a specific attribute graph. To the best of our knowledge, this is the first attempt to construct the detailed graph structure and then the embedding automatically from annotated reports.

A hierarchical attention mechanism is proposed to aggregate the abnormality and attribute embeddings, and a gate mechanism is employed to integrate ATAG embedding into both LSTM- and Transformer-based decoder for radiology report generation. We performed comprehensive empirical evaluation on the benchmark datasets. Our experiment results show that the proposed ATAG-based deep model can improve the accuracy for both abnormality classification and radiology report generation compared to the state-of-the-art models. Potential future directions include consideration of ambiguous and potentially incorrect annotations, as well as integration of EHR data of different modalities to further achieve clinical accuracy.


  • [1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 6077–6086. Cited by: §2.
  • [2] J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §6.4.
  • [3] S. Biswal, C. Xiao, L. Glass, B. Westover, and J. Sun (2020) CLARA: clinical report auto-completion. In Proceedings of The 29th International World Wide Web Conference, pp. 541–550. Cited by: §1, §2.2.
  • [4] Z. Chen, Y. Shen, Y. Song, and X. Wan (2021) Cross-modal memory networks for radiology report generation. In

    Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

    pp. 5904–5914. Cited by: §2.4, §8.2, §8.5, TABLE IV.
  • [5] Z. Chen, Y. Song, T. Chang, and X. Wan (2020) Generating radiology reports via memory-driven transformer. In Proceedings of the Conference on Empirical Methods in Natural Language Processing 2020, pp. 1439–1449. Cited by: §1, §2.4, §7, §8.1, §8.2, §8.5, §8.5, TABLE IV.
  • [6] M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara (2020) Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10578–10587. Cited by: §8.2, TABLE IV.
  • [7] D. Demner-Fushman, M. D. Kohli, M. B. Rosenman, S. E. Shooshan, L. Rodriguez, S. Antani, G. R. Thoma, and C. J. McDonald (2016) Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association 23 (2), pp. 304–310. Cited by: §1, §7, Fig. 7, §8.1, TABLE II.
  • [8] A. Graves (2012) Long short-term memory. Supervised sequence labelling with recurrent neural networks, pp. 37–45. Cited by: §6.3.
  • [9] Z. Han, B. Wei, S. Leung, J. Chung, and S. Li (2018) Towards automatic report generation in spine radiology using weakly supervised framework. In Proceedings of the 21th International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 185–193. Cited by: §1, §2.1.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §6.4.
  • [11] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708. Cited by: §5, §5, §8.2.
  • [12] J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. Ball, K. Shpanskaya, et al. (2019) Chexpert: a large chest radiograph dataset with uncertainty labels and expert comparison. In

    Proceedings of the 33th AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 590–597. Cited by: §7, §8.2, TABLE III.
  • [13] S. Jain, A. Agrawal, A. Saporta, S. Q. Truong, D. N. Duong, T. Bui, P. Chambon, Y. Zhang, M. P. Lungren, A. Y. Ng, et al. (2021) RadGraph: extracting clinical entities and relations from radiology reports. arXiv preprint arXiv:2106.14463. Cited by: §4.2.
  • [14] B. Jing, Z. Wang, and E. P. Xing (2019) Show, describe and conclude: on exploiting the structure information of chest X-ray reports. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 6570–6580. Cited by: §1.
  • [15] B. Jing, P. Xie, and E. P. Xing (2018) On the automatic generation of medical imaging reports. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 2577–2586. Cited by: §1, §2.2, §8.2, TABLE IV.
  • [16] A. E. Johnson, T. J. Pollard, S. J. Berkowitz, N. R. Greenbaum, M. P. Lungren, C. Deng, R. G. Mark, and S. Horng (2019) MIMIC-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific Data 6 (1), pp. 317. Cited by: §1, §4.2, §7, §8.1, TABLE II.
  • [17] J. Krause, J. Johnson, R. Krishna, and L. Fei-Fei (2017) A hierarchical approach for generating descriptive image paragraphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 317–325. Cited by: §2.
  • [18] C. Y. Li, X. Liang, Z. Hu, and E. P. Xing (2019) Knowledge-driven encode, retrieve, paraphrase for medical image report generation. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, pp. 6666–6673. Cited by: §1, §2.3, §6.2, §8.1, §8.1.
  • [19] C. Y. Li, X. Liang, Z. Hu, and E. P. Xing (2018) Hybrid retrieval-generation reinforced agent for medical image report generation. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 1530–1540. Cited by: §1, §1, §8.5.
  • [20] C. Lin (2004) Rouge: a package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81. Cited by: §8.1.
  • [21] F. Liu, X. Wu, S. Ge, W. Fan, and Y. Zou (2021) Exploring and distilling posterior and prior knowledge for radiology report generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13753–13762. Cited by: §1, §1, §2.3, §6.2.
  • [22] F. Liu, C. Yin, X. Wu, S. Ge, P. Zhang, and X. Sun (2021-08) Contrastive attention for automatic chest X-ray report generation. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, pp. 269–280. External Links: Link, Document Cited by: §2.4.
  • [23] F. Liu, C. You, X. Wu, S. Ge, X. Sun, et al. (2021) Auto-encoding knowledge graph for unsupervised medical report generation. Advances in Neural Information Processing Systems 34. Cited by: §1, §6.2.
  • [24] G. Liu, T. H. Hsu, M. McDermott, W. Boag, W. Weng, P. Szolovits, and M. Ghassemi (2019) Clinically accurate chest X-ray report generation. In

    Proceedings of Machine Learning for Healthcare Conference 2019

    pp. 249–269. Cited by: §1.
  • [25] J. Lu, C. Xiong, D. Parikh, and R. Socher (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 375–383. Cited by: §8.2, TABLE IV.
  • [26] M. Martínez-Romero, C. Jonquet, M. J. O’connor, J. Graybeal, A. Pazos, and M. Musen (2017) NCBO ontology recommender 2.0: an enhanced approach for biomedical ontology recommendation. Journal of Biomedical Semantics 8 (21). Cited by: §4.2, §7.
  • [27] Y. Miura, Y. Zhang, E. Tsai, C. Langlotz, and D. Jurafsky (2021) Improving factual completeness and consistency of image-to-text radiology report generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5288–5304. Cited by: §1, §2.2, §7, §8.1, §8.5.
  • [28] I. Najdenkoska, X. Zhen, M. Worring, and L. Shao (2021) Variational topic inference for chest x-ray report generation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 625–635. Cited by: §8.5.
  • [29] K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318. Cited by: §8.1.
  • [30] H. Park, K. Kim, J. Yoon, S. Park, and J. Choi (2020) Feature difference makes sense: a medical image captioning model exploiting feature difference and tag information. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pp. 95–102. Cited by: §2.2.
  • [31] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel (2017) Self-critical sequence training for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7008–7024. Cited by: §2.
  • [32] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §6.4.
  • [33] T. Syeda-Mahmood, K. C. Wong, Y. Gur, J. T. Wu, A. Jadhav, S. Kashyap, A. Karargyris, A. Pillai, A. Sharma, A. B. Syed, et al. (2020) Chest x-ray report generation through fine-grained label learning. In Proceedings of the 23th International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 561–571. Cited by: §1, §2.2.
  • [34] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §6.4, TABLE IV.
  • [35] R. Vedantam, C. Lawrence Zitnick, and D. Parikh (2015) Cider: consensus-based image description evaluation. In Proceedings of the 28th IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575. Cited by: §8.1.
  • [36] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §3, §5.
  • [37] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2015) Show and tell: a neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164. Cited by: §2.
  • [38] X. Xie, Y. Xiong, P. S. Yu, K. Li, S. Zhang, and Y. Zhu (2019) Attention-based abnormal-aware fusion network for radiology report generation. In Proceedings of the 24th International Conference on Database Systems for Advanced Applications, pp. 448–452. Cited by: §1, §2.1.
  • [39] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 37, pp. 2048–2057. Cited by: §2, §8.2, TABLE IV.
  • [40] Y. Xue, T. Xu, L. Rodney Long, Z. Xue, S. Antani, G. R. Thoma, and X. Huang (2018) Multimodal recurrent model with attention for automated radiology report generation. In Proceedings of the 21th International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 457–466. Cited by: §1, §2.1.
  • [41] X. Yang, M. Ye, Q. You, and F. Ma (2021) Writing by memorizing: hierarchical retrieval-based medical report generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5000–5009. Cited by: §1, §2.1.
  • [42] D. You, F. Liu, S. Ge, X. Xie, J. Zhang, and X. Wu (2021) AlignTransformer: hierarchical alignment of visual regions and disease tags for medical report generation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 72–82. Cited by: §2.4, §8.5.
  • [43] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo (2016) Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4651–4659. Cited by: §2.
  • [44] J. Yuan, H. Liao, R. Luo, and J. Luo (2019) Automatic radiology report generation based on multi-view image fusion and medical concept enrichment. In Proceedings of the 22th International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 721–729. Cited by: §1, §2.2, §8.2, TABLE IV.
  • [45] Y. Zhang, X. Wang, Z. Xu, Q. Yu, A. Yuille, and D. Xu (2020) When radiology report generation meets knowledge graph. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, pp. 12910–12917. Cited by: §1, §2.3, §5, §5, §6.2, §7, §7, §8.1, §8.2, §8.2, §8.3, TABLE III, TABLE IV.