Log In Sign Up

Chest X-ray Report Generation through Fine-Grained Label Learning

by   Tanveer Syeda-Mahmood, et al.

Obtaining automated preliminary read reports for common exams such as chest X-rays will expedite clinical workflows and improve operational efficiencies in hospitals. However, the quality of reports generated by current automated approaches is not yet clinically acceptable as they cannot ensure the correct detection of a broad spectrum of radiographic findings nor describe them accurately in terms of laterality, anatomical location, severity, etc. In this work, we present a domain-aware automatic chest X-ray radiology report generation algorithm that learns fine-grained description of findings from images and uses their pattern of occurrences to retrieve and customize similar reports from a large report database. We also develop an automatic labeling algorithm for assigning such descriptors to images and build a novel deep learning network that recognizes both coarse and fine-grained descriptions of findings. The resulting report generation algorithm significantly outperforms the state of the art using established score metrics.


page 1

page 2

page 3

page 4


Extracting and Learning Fine-Grained Labels from Chest Radiographs

Chest radiographs are the most common diagnostic exam in emergency rooms...

Show, Describe and Conclude: On Exploiting the Structure Information of Chest X-Ray Reports

Chest X-Ray (CXR) images are commonly used for clinical screening and di...

Automated Radiological Report Generation For Chest X-Rays With Weakly-Supervised End-to-End Deep Learning

The chest X-Ray (CXR) is the one of the most common clinical exam used t...

Spatially-Preserving Flattening for Location-Aware Classification of Findings in Chest X-Rays

Chest X-rays have become the focus of vigorous deep learning research in...

On the Importance of Image Encoding in Automated Chest X-Ray Report Generation

Chest X-ray is one of the most popular medical imaging modalities due to...

Attributed Abnormality Graph Embedding for Clinically Accurate X-Ray Report Generation

Automatic generation of medical reports from X-ray images can assist rad...

Radiology Report Generation with a Learned Knowledge Base and Multi-modal Alignment

In clinics, a radiology report is crucial for guiding a patient's treatm...

1 Introduction

Chest X-rays are the most common imaging modality read by radiologists in hospitals and tele-radiology practices today. With advances in artificial intelligence, there is now the promise of obtaining automated preliminary reads that can expedite clinical workflows, improve accuracy and reduce overall costs. Current automated report generation methods are based on image captioning approaches of computer vision

[21, 25]

and use an encoder-decoder architecture where a Convolutional Neural Network (CNN) is used to encode images into a set of semantic topics

[10] or limited findings [29]

and a Recurrent Neural Network (RNN) decoder or a hierarchical LSTM generates the most likely sentence given the topics

[7, 19, 12, 23, 14, 10]. Other approaches have leveraged template sentences to aid in paraphrasing and report generation [14, 4, 13]. Recent approaches have also emphasized the role of clinical accuracy measured loosely through clinical correlation between disease states in the objective functions [16].

Despite the progress, the quality of reports generated by current approaches is not yet clinically acceptable as they do not ensure the correct detection of a comprehensive set of findings nor the description of their clinical attributes such as laterality, anatomical location, severity, etc. The emphasis is usually more on the report language generation than the visual detection of findings. In this paper we take a new approach in which we train deep learning networks on a large number of detailed finding labels that represent an in-depth and comprehensive characterization of findings in chest X-ray images. An initial set of core findings label vocabulary were derived through a multi-year chest X-ray lexicon building effort involving several radiologists and clinical experts. The detailed finding labels were then automatically derived from their associated radiology reports through a concept detection and phrasal grouping algorithm that associates detailed characterization modifiers with the initially identified core findings using natural language analysis. The resulting labels with large image support were used to train a novel deep learning network based on feature pyramids. Given a new chest X-ray image, the joint occurrence of detailed finding labels is predicted as a pattern vector from the learned model and is matched against a pre-assembled database of label patterns and their associated reports. Finally, the retrieved report is post-processed to remove mentioned findings whose evidence is absent in the predicted label pattern.

Figure 1: Illustration of the finer description labels for capturing the essence of reports.

2 Describing images through fine finding labels (FFL)

Consider the chest X-ray image shown in Figure 1a. Its associated report is shown in Figure 1b. In order to automatically produce such sentences from analyzing images, we need image labels that cover not only the core finding, such as opacity, but also its laterality, location, size, severity, appearance, etc. Specifically, a full description of the finding can be denoted by a fine finding label (FFL) as


where is the FFL label, is the finding type, indicates a positive or negative finding (i.e is present versus absent), is the core finding itself, and are one or more of the possible finding modifiers. The finding types in chest X-rays are adequately covered by six major categories namely, anatomical findings, tubes and lines and their placements, external devices, viewpoint-related issues, and implied diseases associated with findings. The vocabulary for core findings as well as possible modifiers was semi-automatically assembled through a multi-year chest X-ray lexicon development process in which a team of 4 clinicians including 3 radiologists, iteratively searched through the best practice literature such as Fleishner Society guidelines [6] and used every day use terms to expand the vocabulary by examining a large dataset of 220,000 radiology reports in a vocabulary building tool [2] addressing abbreviations, misspellings, semantical equivalence and ontological relationships. Currently, the lexicon consists of over 11000 unique terms covering the space of 78 core findings and 9 modifiers and represents the largest set of core findings assembled so far. The set of modifiers associated with each core finding also depends on the finding type and the FFL label syntax captures these for various finding types.

The FFL labels capture the essence of a report adequately as can be seen in Figure 1c and comparing with the actual report in Figure 1b. Further, if the FFL labels are similar, a similarity is also implied in the associated reports. Figure 1d-g show examples of similar reports all of which are characterized by similar FFL patterns. Thus if we can infer the FFL labels from the visual appearance of findings in chest X-ray images, we can expect to generate an adequate report directly from the labels.

2.1 Extraction of FFL Labels from reports

The algorithm for extracting FFL labels from sentences in reports consists of 4 steps, namely, (a) core finding and modifier detection, (b) phrasal grouping, (c) negation sense detection, (d) pattern completion.

The vocabulary of core findings from lexicon and their synonyms were used to detect core concepts in sentences of reports using the vocabulary-driven concept extraction algorithm described in [5]. To associate modifiers with relevant core findings, we used a natural language parser called the ESG parser [17] which performed word tokenization and morpholexical analysis to create a dependency parse tree for the words in a sentence as shown in Figure 2. The initial grouping of words is supplied directly by the parse tree such as the grouping of terms ’alveolar’ and ’consolidation’ into one term ’alveolar consolidation’ shown in Figure 2. Further phrasal grouping is done by clustering the lemmas using word identifiers specified in the dependency tree. For this, a connected component algorithm is used on the word positions in slots, skipping over unknowns (marked with u in tuples). This allows all modifiers present within a phrasal group containing a core finding to be automatically associated with the finding. For example, the modifier ’stable’ is associated with the core finding ’alveolar consolidation’ in Figure 2. The modifiers in phrasal groups that do not contain a core finding are associated with the adjacent phrasal groups that contain a core finding.

To determine if a core finding is a positive or negative finding (e.g. ”no pneumothorax”), we use a two-step approach that combines language structuring and vocabulary-based negation detection as described in [5]. The negation pattern detection algorithm iteratively identifies words within the scope of negation by iteratively expanding neighborhood of seed negation terms by traversing the dependency parse tree of a sentence. The details are described in [5].

The last step completes the FFL pattern using a priori knowledge captured in the lexicon for the associated anatomical locations of findings when these are not specified in the sentence itself as shown in Figure 2 where the term ”alveoli” is inserted from the knowledge of the location of the finding ’alveolar consolidation’. Thus the final FFL label produced may show more information than the original sentence from which it was extracted. In addition, the name of the core finding may be ontologically rolled up to the core findings as seen in Figure 1 for ’emphysema’.

Figure 2: Illustration of the dependency parse tree and phrasal grouping.

The FFL label extraction algorithm was applied to all sentences from a collection of 232,964 reports derived from MIMIC-4 [11] and NIH [22] datasets, to generate all possible FFL patterns from the Findings and Impression sections of reports. A total of 203,938 sentences were processed resulting in 102,135 FFL labels. By retaining only those labels with at least 100 image support, a total of 457 FFL labels were selected. As shown in the Results section, the label extraction process is highly accurate, so that spot check clinical validation is sufficient for use in image labeling. Since the FFL labels were seeded by clinically selected core findings, nearly 83% of all FFL labels extracted could be mapped into their nearest counterpart in the 457 FFL Label set. Thus the set of 457 labels were found sufficient to cover a wide variety in spoken sentences and were used as labels for building deep learning models. Of these, 78 were the original core labels (called the CFL labels) given by clinicians, and the remaining were finer description labels with modifiers extracted automatically.

2.2 Learning FFL labels from images

The learning of FFL labels from chest X-rays is a fine-grained classification problem for which single networks used for computer vision problems may not yield the best performance, particularly since large training sets are still difficult to obtain. The work in [18]

shows that concatenating different ImageNet-pretrained features from different networks can improve classification on microscopic images. Following this idea, we combine the ImageNet-pretrained features from different models through the Feature Pyramid Network in

[15]. This forms the multi-model feature pyramid which combines the features in multiple scales. The VGGNet (16 layers) [20] and ResNet (50 layers) [8] are used as the feature extractors. As nature images and chest X-rays are in different domains, low-level features are used. From the VGGNet, the feature maps with 128, 256, and 512 channels are used, which are concatenated with the feature maps from the ResNet of the same spatial sizes which have 256, 512, and 1024 feature channels.

We propose dilated blocks to learn the high-level features from the extracted ImageNet features. Each dilated block is composed of dilated convolutions for multi-scale features [27], a skip connection of identity mapping to improve convergence [9], and spatial dropout to reduce overfitting. Group normalization (16 groups) [24]

whose performance is independent of the training batch size is used with ReLU. Dilated blocks with different feature channels are cascaded with maxpooling to learning more abstract features. Instead of global average pooling, second-order pooling is used, which is proven to be effective for fine-grained classification

[28]. Second-order pooling maps the features to a higher-dimensional space where they can be more separable. Following [28], the second-order pooling is implemented as a 11 convolution followed by global square pooling (Figure 3).

Figure 3: Illustration of the custom deep learning network developed for large number of label recognition problem.

Image augmentation with rigid transformations is used to avoid overfitting. As most of an image should be included, we limit the augmentation to rotation () and shifting (

10%). The probability of an image to be transformed is 80%. The optimizer Nadam is used with a learning rate of 2


, a batch size of 48, and 20 epochs. To ensure efficient learning, we developed two instances of this network, one for the core finding labels (CFL labels) and the other for detailed FFL labels that have a support of at least 100 images for training to exploit the mutually reinforcing nature of the coarse-fine labels. Due to the variability in the size of the dataset per FFL label, the AUC per FFL label is not always a good indicator for precision on a per image level as it is dominated by the negative examples. To ensure we report as few irrelevant findings while still detecting all critical findings within an image, we select operating points on the ROC curves per label based on optimizing the F1 score, a well-known measure of accuracy, as


2.3 FFL Pattern-Report Database Creation

Using the FFL label detection algorithm, we can describe a report (its relevant sections) as a binary pattern vector where if the FFL label is present in the report and zero otherwise. Here is the set of FFL labels used in training the deep learning models. During the database creation process, we collect all reports characterized by the same binary pattern vector, and rank them based on the support provided by their constituent sentences. Let be the collection of reports spanned by a pattern vector . Then


where is the number of relevant sentences in report spanned by one or more of the FFL labels in the pattern . Here is given by The highest ranked reports are then stored as associated reports with the binary pattern vectors in a database.

2.4 Report assembly

The overall report generation workflow is illustrated in Figure 5. An image is fed to the two deep learning networks built for CFL and FFL patterns and their predictions thresholded using the image-based precision-recall F1-score for optimization. The resulting pattern vectors are combined to result in the consolidated FFL pattern vector . The best matching reports are then derived from the semantically nearest pattern vectors in the database. The semantic distance between the query FFL bit pattern vector and a matching pattern vector from the database is given by

Figure 4: Illustration of quality of reports generated by different methods.

where is the weight associated with the FFL label . A criticality rank for each core findings on a scale of 1 to 10 was supplied by the clinicians which was normalized and used to weigh the clinical importance of a finding during matching. Once the matching FFL pattern is determined, the highest ranked report as given by Equation 3 associated with the FFL pattern is retrieved as the best matching report. Finally, we drop all sentences from the retrieved report whose evidence cannot be found in the FFL label pattern of the query thus achieving the variety needed in returned reports per query. Although with 457 FFL labels, the number of possible binary patterns would be large (), due to the sparseness of 5-7 findings per report, the actual number of distinct binary patterns in the database of over 232,000 reports was only 924 patterns corresponding to 5246 distinct sentences in the precomputed ranked list across all patterns. Thus the lookup per predicted pattern is a fairly trivial operation which is O(1) with indexing and takes less than 5 msec.

3 Results

We collected datasets from three separate sources, namely, the MIMIC [11] dataset of over 220,000 reports with associated frontal images, 2964 unique Indiana reports [3] with associated images and a set of 10,000 NIH [22] released images re-read by our team of radiologists to produce a total of 232,964 image-report pairs for our experiments.

Dataset FFL Label Phases Train Validate Test Average AUC Weighted
MIMIC-4 + NIH CFL labels 249,286 35,822 70,932 0.805 0.841
MIMIC-4+NIH FFL labels 75,613 10,615 20,941 0.729 0.716
Table 1: Illustration of the datasets and performance of fine grained classification model for CFL and FFL labels (last column is average of AUCs weighted by the number of samples per each category).

Evaluating FFL label extraction accuracy: We evaluated the accuracy of FFL label extraction by noting the number of findings missed and overcalled (which included errors in negation sense detection) as well the correctness and completeness of association of modifiers with the relevant core findings. The result of evaluation for the Indiana dataset [3] by our clinicians is shown in Table 2

. As can be seen, the FFL label extraction is highly accurate in terms of the coverage of findings with around 3% error mostly due to negation sense detection. Further, the association of modifiers to core findings given by the phrasal grouping algorithm is also accurate with over 99% precision and recall.

FFL Label Prediction from Deep Learning: The training, validation and test image datasets used for building the CFL and FFL models used MIMIC-4 and NIH datasets as shown in Table 1. The AUC averaged for all CFL labels and FFL labels is also shown in that table. In addition, using the F1-score-based optimization, the mean average image-based precision for CFL labels was 73.67% while the recall was 70.6%.

Figure 5: Illustration of the report generation algorithm.
reports relevant FFL patterns missed overcall incorrect association missed
analyzed sentences extracted findings (negated findings) of modifiers modifiers
2964 3046 5245 0 168 49 11
Table 2: The accuracy of FFL label extraction from reports.

Evaluation of report generation: Due to the ontological mapping used to abstract the description of findings, the match produced from our approach is at a more semantic level rather than lexical in comparison to other approaches. Figure 4 shows the reports manually and automatically produced by our approach and a comparative approach implemented from a visual attention-based captioning model [21]. We compared the performance of our algorithm with several state-of-the-art baselines from recent literature [21, 26, 13, 16, 10, 29]. These included a range of approaches from visual attention-based captioning [21], knowledge-driven report generation [13], clinically accurate report generation [16], to a strawman approach using a set of template sentences manually chosen by clinicians for the FFL labels instead of the nearest report selection algorithm described earlier. Although we have tested our algorithm for very large number of images from the combined MIMIC-NIH data, for purposes of comparison, we show the results on the same Indiana test dataset that has been used most commonly by other algorithms as reported in [10]. The resulting performance using the popular scoring metrics is shown in Table 3 showing that our algorithm outperforms other approaches in all the established scoring metrics.

Conclusions: We presented an explainable AI approach to semantically correct radiology report generation. The results show superior performance both because of the detailed descriptive nature of labels, and due to a statistically informed report retrieval process that ensures a semantic match.

Vis-Att [21] 0.39 0.25 0.16 0.11 0.16 0.32
MM-Att [26] 0.46 0.35 0.27 0.19 0.27 0.36
KERP [13] 0.48 0.32 0.22 0.16 - 0.33
Template-based 0.28 0.29 0.32 0.27 0.35 0.34
Clinical Accurate [16] 0.35 0.22 0.15 0.1 - 0.45
Co-Att [10] 0.51 0.39 0.30 0.25 0.21 0.44
Jiebo Luo [29] 0.53 0.37 0.31 0.25 0.34 0.45
CFL-only 0.49 0.39 0.36 0.32 0.48 0.52
FFL+CFL -based (ours) 0.56 0.51 0.5 0.49 0.55 0.58
Table 3: Comparative performance of report generation by various methods.