Multi-task Learning for Chest X-ray Abnormality Classification on Noisy Labels

by   Sebastian Guendel, et al.

Chest X-ray (CXR) is the most common X-ray examination performed in daily clinical practice for the diagnosis of various heart and lung abnormalities. The large amount of data to be read and reported, with 100+ studies per day for a single radiologist, poses a challenge in maintaining consistently high interpretation accuracy. In this work, we propose a method for the classification of different abnormalities based on CXR scans of the human body. The system is based on a novel multi-task deep learning architecture that in addition to the abnormality classification, supports the segmentation of the lungs and heart and classification of regions where the abnormality is located. We demonstrate that by training these tasks concurrently, one can increase the classification performance of the model. Experiments were performed on an extensive collection of 297,541 chest X-ray images from 86,876 patients, leading to a state-of-the-art performance level of 0.883 AUC on average for 12 different abnormalities. We also conducted a detailed performance analysis and compared the accuracy of our system with 3 board-certified radiologists. In this context, we highlight the high level of label noise inherent to this problem. On a reduced subset containing only cases with high confidence reference labels based on the consensus of the 3 radiologists, our system reached an average AUC of 0.945.



page 1

page 2

page 4

page 5


CheXpedition: Investigating Generalization Challenges for Translation of Chest X-Ray Algorithms to the Clinical Setting

Although there have been several recent advances in the application of d...

Deep Learning for Automatic Pneumonia Detection

Pneumonia is the leading cause of death among young children and one of ...

SwinCheX: Multi-label classification on chest X-ray images with transformers

According to the considerable growth in the avail of chest X-ray images ...

Quantifying and Leveraging Classification Uncertainty for Chest Radiograph Assessment

The interpretation of chest radiographs is an essential task for the det...

Pneumothorax and chest tube classification on chest x-rays for detection of missed pneumothorax

Chest x-ray imaging is widely used for the diagnosis of pneumothorax and...

An Adaptive Enhancement Based Hybrid CNN Model for Digital Dental X-ray Positions Classification

Analysis of dental radiographs is an important part of the diagnostic pr...

Deep Hiearchical Multi-Label Classification Applied to Chest X-Ray Abnormality Taxonomies

CXRs are a crucial and extraordinarily common diagnostic tool, leading t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recent developments in the deep learning community combined with the availability of large annotated datasets have enabled the training of automated systems that can reach super-human performance on a variety of classification, detection and segmentation tasks [1, 2, 3]. In different scenarios, such systems actively support humans, increasing the efficiency and accuracy of their workflow. In the medical domain, deep learning systems for image and data analysis and integration can potentially have an even greater impact, supporting the clinical workflow from patient admission to diagnosis, treatment and follow-up investigations [4, 5].

In this paper, we focus on the problem of diagnosing multiple abnormalities based on chest radiographs (CXR) of the human body. In practice, this is a challenging problem reflected in a significant inter-user variability between different radiologists [6]. The main reasons include the complex appearance of pathologies in X-ray projection images, and the large number of scans that need to be read and analyzed daily under time pressure [7]. The average time to read and report a plain film is 1.4 minutes [8].

To address this challenge, we propose a method that can support the automatic classification of 12 abnormalities visible in chest X-rays. The system is based on a novel deep learning architecture which is able to predict - in addition to classification scores of abnormalities - lung/heart masks and classes about the location of certain abnormalities. By embedding the additional learning tasks, we observe a performance gain for abnormality classification.

Since we encountered a high ratio of class label noise in the dataset, we conducted an observer study wherein two additional radiologists re-labeled a subset of the data. Based on the new annotations, strategies were applied to change the original dataset labels and to remove uncertain cases for a significant performance gain of our learned model. Furthermore, a strong correlation of most abnormalities between radiologist consensus and the remaining, certain cases can be seen.

The contributions of this paper are as follows:

  • We propose a novel multi-task deep neural network for multi-abnormality classification, spatial classification and lung/heart segmentation based on coronal chest X-ray images.

  • We demonstrate that the use of additional spatial knowledge related to pathologies or the underlying anatomical structures, i.e., heart and lungs, can significantly increase the accuracy of the abnormality classification.

  • To cope with the brightness and contrast variability of images from different sources, we propose a novel normalization technique. With this method we not only increase performance, but also significantly accelerate the training time.

  • We study the inter-user variability for this problem based on the input of three board-certified radiologists.

  • Based on the annotations of the three radiologists, we create a reduced evaluation subset using different voting strategies and show that the performance of our system increases.

  • Finally, we demonstrate the high correlation between the network classification probability and the multi-reader agreement.

This paper builds on our preliminary work in X-ray abnormality detection systems [9] with the following additional contributions: (a) a multi-radiologist observer study with extensive analysis of label noise and its impact on based on the detection system, (b) a task-specific normalization technique to increase robustness to variability caused by different acquisition equipment or post-processing steps, (c) an extension of the multi-task network for additional prediction of segmentation tasks, and (d) the prediction of confidence scores in addition to the classification of abnormalities.

Ii Related Work

The publication of the ChestX-ray14 (NIH) dataset [10] has led to series of recent publications that propose automatic systems for abnormality classification. Wang et al. [10]

evaluated several state-of-the-art convolutional neural network architectures, reporting an area under the ROC curve (AUC) of 0.75 on average. Islam et al.

[11] defined an ensemble of multiple state-of-the-art network architectures to increase the classification performance. Rajpurkar et al. [12] demonstrated that a common DenseNet architecture [13] can surpass the accuracy of radiologists in detecting pneumonia. In addition to a DenseNet, Yao et al. [14]

implemented a Long-short Term Memory (LSTM) model to exploit dependencies between the abnormalities. An attention guided convolutional neural network architecture was used by Guan et al.

[15] to specifically focus on the region of interest which is provided in a second network branch as a cropped image with higher resolution. Rubin et al. [16] designed a Dual-Network to extract the image information of both frontal and lateral views. Yan et al. [17] used a DenseNet architecture integrated with ”Squeeze-and-Excitation” blocks [18] to improve the performance.

Wang et al. [19] developed a new network architecture with both a classification and an attention branch, where the latter calculates activation maps with gradient-weighted class activation mapping [20] which is subsequently concatenated with the classification branch. Based on very limited location annotations of the abnormalities, Li et al. [21] trained a neural network to predict both classification and localization of the abnormalities. Liu et al. [22] designed a network architecture with two branches, similar to [15], where the second branch used a cropped image input based on existing lung masks. Yao et al. [23] defined a network architecture which can be trained on different resolutions. Rajpurkar et al. [24] used multiple radiologists to reannotate the images: One subgroup of radiologists defined the ground truth of a set where the other subgroup and the neural network was evaluated on. In this way, performance of both the radiologists and the deep learning algorithm could be compared. Irvin et al. [25] trained on a dataset where the ground truth consists of an additional uncertainty class. Different approaches were applied during training to increase the performance with the uncertainty information.

We emphasize that most of the published works report classification results by splitting the data completely randomly for training, validation and testing [14, 15, 21, 11]. With this splitting strategy, images of the same patient may be located in both training and testing set. For example the ChestX-ray14 dataset has an average of 3.6 images per patient. For a fair performance evaluation, the splitting should always be performed at patient level. The official split for the ChestX-ray14 data is performed patient wise. In our work, we use the official split. Moreover, for a valid performance comparison, same splits should be used since there is a significant performance variability by using different test sets [9].

Iii Problem Definition and Methodology

Given an arbitrary anterior-posterior (AP) or posterior-anterior (PA) chest X-ray image with size , we design a deep learning based system parametrized by which outputs the probability of different abnormalities being present in the image: , where and is the number of considered abnormalities. In addition, the system is designed to compute a probabilistic segmentation map for both lung lobes and the heart.

Iii-a Dataset

Fig. 2: This graph shows the number of images along with all abnormalities. The chart excludes the number of images where none of these pathologies appear.

Our data collection is composed of two different datasets. The ChestX-Ray14 (NIH) [10] and PLCO [26] dataset. These datasets differ in several aspects. Table I gives an overview. By combining both datasets, we can make use of 297,541 frontal chest X-ray images from 86,876 patients. Due to follow-up scans, there is an average of 3-4 images per patient. Therefore, patient-wise splits are considered for all experiments to separate the patients into training, validation, and test set. The PLCO dataset includes spatial information for some abnormality classes. Figure 2 shows the number of images observed to contain each abnormality. One image can also show multiple abnormalities. Additionally, the collections contain 178,319 images where none of the mentioned abnormalities appear, these images are not counted in Figure 2.

ChestX-Ray14 PLCO

Number of images
112,120 185,421
Number of patients 30,805 56,071
Avg. image number per patient 3.6 3.3
Number of abnormalities 14 12
Image size
Spatial information of abnorm. no partly

TABLE I: Overview of the 2 datasets. Combining both, a new data collection with 297,541 images from 86,876 patients was created.

Please note, the high imbalance of the data collection with respect to different abnormalities represents a challenge in ensuring training stability and performance.

Iii-B Deep Neural Network Design

The classification branch of our multi-task network is inspired from the DenseNet architecture [13]. We adopt this network architecture with 5 dense blocks and a total of 121 convolutional layers (see Figure 1

for an overview). Each dense block consist of several dense layers which include batch normalization, rectified linear units, and convolution. The novelty of the DenseNet are the skip connections, meaning that within a block, each layer is connected to all subsequent layers. Between each dense block, a so-called transition layer is added, which includes batch normalization, convolution, and pooling, to reduce the dimensions.

The single grayscale input image is rescaled to (in our experiments =256 or

=512) using bilinear interpolation, replicated to 3 channels and fed into the network. The global average pooling (GAP) layer is resized depending on the input size. The number of output units is set to the number of abnormality classes

. We use sigmoid activation functions for each class to map the output to a probability interval

. The network is initialized with the pre-trained ImageNet model


Iii-C Classification Training

A multi-label problem poses several challenges. The training process is modified such that each class can be trained individually. We create D

binary cross-entropy loss functions. The corresponding labels

(absence or presence of the abnormality, respectively) are compared with the network output and the loss is measured. Due to the highly imbalanced problem we introduce additional weight constants and to the cross-entropy function:


where and , with and indicating the number of cases where the abnormality indexed by n is present, respectively missing from the entire training set. For all experiments, we train with 128 samples in each batch. The Adam optimizer [28] (, , ) is used with an adaptive learning rate: the learning rate is initialized with and reduced by a factor of 10 when the validation loss plateaus. For the PLCO dataset, 70% of the subjects were used for training, 10% for validation, and 20% for testing. For the ChestX-ray14 dataset, we use the provided ChestX-Ray14 benchmark split [10].

Dataset Combination: We propose to use two datasets that were acquired and annotated separately and differently. Both datasets contain several labels with the same definition as Figure 2 shows. A major issue of abnormality labeling is the varying and overlapping definition and interpretation between radiologists and abnormalities [7]. Therefore, we treat the correspondent abnormalities of both datasets separately. Given abnormalities of the ChestX-ray14 dataset and abnormalities of the PLCO dataset, we define classes for our network. Furthermore, we only compute gradients for labels of one dataset where the current image is derived from. This strategy avoids a class categorization step beforehand and ensures that each network layer (except the last) receives information of all images.

Normalization: One challenge in processing chest radiographs is accounting for the large variability of the image appearance, depending on the acquisition source, radiation dose as well as proprietary non-linear postprocessing. In practice, one cannot systematically address this variation due to missing meta-information (e.g., unknown maximum high voltage for images from the NIH dataset). In this context, generic solutions have been proposed for the normalization of radiographs using multi-scale contrast enhancement/leveling techniques [29, 30].

For our diagnostic application, we propose to explicitly avoid altering the image appearance using one of these methods. Instead, we propose an efficient method for dynamically windowing each image, i.e., adjust the brightness and contrast via a linear transformation of the image intensities. Given an arbitrary chest X-ray image

, let us denote its pixel value histogram function as . Using Gaussian smoothing and median filtering, one can significantly reduce the noise of (visible as, e.g., signal spikes due to black background or white text overlay) as well as account for long function tails that affect the windowing of the image. As such, based on the processed function , we determine two bounds , and which represent a tight intensity window for image I. The normalization is applied as follows, . A visual example is shown in Figure 3.

Fig. 3: An original image of the dataset is displayed (left) where the described normalization technique is applied (right).

Iv Integrating Additional Knowledge

Additional spatial knowledge related to individual pathologies as well as the underlying anatomy, i.e., the heart and the lungs, can be exploited to increase the classification performance. Each individual feature described in this section lead to a moderate average performance gain. However, a combination of all features significantly improves the performance.

Iv-a Lung and Heart Segmentation

First, one can focus the learning task to the heart and lung regions. The image information outside of these regions may be regarded as irrelevant for the diagnosis of these lung/heart abnormalities.

Instead of providing the predicted masks as input for the classification network, we extend the classification network with a decoder branch and predict the masks. In this way, the additional knowledge about the shape of the heart and lung lobes is integrated in an implicit way, i.e., during learning through the flow of gradients. As such, in the encoder part, the network learns features that are not only relevant for the abnormality classification, but also for the isolation/segmentation of the relevant image regions.

The DenseNet model described in subsection III-B is extended to solve the segmentation task. Therefore, we add a decoder network whose input is the returning feature maps of the last dense block. (see Figure 1). The decoder architecture is visualized in Figure 4. For the segmentation task, we use the mean squared error loss function:


where and denotes the output prediction of the current pixel and the corresponding pixel label.

Fig. 4: Architecture of the decoder to predict segmentation masks. The network is connected to the classification network after the last dense block (left). The final layer predicts the lung and heart masks in 2 channels (right).

Iv-B Spatial Knowledge

We propose to add additional supervision during learning using several approximate spatial labels provided with the PLCO data. For five abnormalities (Nodule, Mass, Infiltrate, Atelectasis, Hilar Abnormality) there is coarse location information available (see Table II).

Region No. Region

Left lobe 6 Upper-middle part
2 Right lobe 7 Upper part
3 Lower part 8 Diffused (more attached parts)
4 Lower-middle part 9 Multiple (more independent parts)
5 Middle part

TABLE II: Spatial Class Labels for the PLCO data

The location loss is another weighted cross-entropy loss with location-specific classes. The spatial labels , where F is the total number of spatial classes listed in Table II, are compared with the network prediction and the loss is calculated (Equation IV-B and 5).


where and , with and indicating, respectively, the number of presence and absence cases of spatial class m in the training set. The individual localization loss is activated/deactivated dynamically: If spatial labels are not available for abnormality , all spatial labels are disregarded and no gradients are computed. Otherwise, the loss is calculated as Equation IV-B shows.

Complete system training: The global loss used for training is composed as follows:


The global architecture is shown in Figure 1.

V Experimental Results

In our experiments, we measure the performance of our system at classifying different abnormalities, at segmenting the heart/lung region and at approximately localizing pathologies within the image. As a baseline, we measure test performance scores by training only the classification part on the PLCO dataset (

). The test set was evaluated with an AUC score of 0.859 (Table III, left).

V-a Lung and Heart Segmentation

The second column in Table III shows improved classification scores when the network was additionally trained to generate lung and heart segmentation masks ().

A test image is visualized in Figure 5 (left). The probabilistic segmentation map is thresholded and overlayed with the image. The red mask defines the heart area, the blue mask indicates the two lungs in Figure 5 (right). A quantitative evaluation of lung and heart segmentation was disregarded since we were focusing on the abnormality classification. The performance increased to 0.866 on average across the abnormalities (Table III, second Column).

Fig. 5: Left: Example Chest X-ray image. Right: Predicted segmentation masks of lung lobes (blue) and heart (red).
Fig. 6: Average AUC scores over the abnormalities with spatial information: Performance without (blue) and with (red) spatial information.

Dim. Size
Features - Seg. Loc. Loc. Loc.+Norm. Loc.+Norm.+Seg
Nodule 0.810 0.815 0.830 0.832 0.831 0.881
Mass 0.829 0.839 0.840 0.867 0.869 0.884
Granuloma 0.884 0.886 0.887 0.887 0.893 0.912
Infiltrate 0.865 0.863 0.864 0.877 0.882 0.891
Scaring 0.841 0.842 0.843 0.850 0.848 0.861
Fibrosis 0.870 0.875 0.863 0.875 0.871 0.884
Bone/Soft Tissue Lesion 0.841 0.846 0.834 0.840 0.850 0.848
Cardiac Abnormality 0.926 0.928 0.922 0.923 0.927 0.926
COPD 0.882 0.883 0.877 0.880 0.882 0.874
Effusion 0.909 0.938 0.927 0.932 0.949 0.939
Atelectasis 0.849 0.858 0.858 0.867 0.855 0.832
Hilar Abnormality 0.796 0.815 0.812 0.815 0.850 0.859
Mean (Location Labels) 0.830 0.838 0.841 0.852 0.857 0.869
Mean 0.859 0.866 0.863 0.870 0.876 0.883

TABLE III: AUC classification scores for experiments tested on PLCO data

V-B Spatial Knowledge

We measured the impact of the location labels on the performance of the classification. An average improvement of 0.011 (on abnormalities supported with spatial information) could be observed, as can be seen in the third column of Table III.

A more detailed experiment in Figure 6 shows that increased patient numbers in the training set improved the test performance. The red curve indicates the average AUC performance on the abnormalities where both the spatial labels exist and localization classification during training is included. The blue one shows the average performance on the same abnormalities trained without classifying the spatial labels.

V-C All-in-One Joint Model

By including the ChestX-ray 14 dataset, the average AUC score reaches 0.870. Performance values of each abnormality can be seen in Column 4 of Table III. Especially low frequency classes in the PLCO dataset were significantly improved, e.g., infiltrate (see Figure 2) with 1,554 images by 0.013 when trained with 19.870 more images of the ChestX-ray 14 dataset.

Normalization: Including the normalization step based on dynamic windowing had a two-fold benefit. First, the training time was reduced on average 2-3 times. We hypothesize that this is because the normalization ensures that images are more aligned in terms of brightness and contrast, which in some sense simplifies the learning task. Second, this also improved the generalization of the model, and led to a performance increase to 0.876.

Finally, we upscaled the input image size to 512 in each dimension and changed the GAP layer of the DenseNet due to the bigger input size to . For the final network architecture we included the segmentation part. The average performance improved to 0.883 (see last column in Table III). Especially on small abnormalities, the classification performance improved significantly, e.g., nodules by 0.05.

With an average AUC performance of 0.883, our network achieves the best performance on such multi-abnormality problem in the community on a data collection trained and evaluated on the original dataset labels based on patient-wise train/test splits.

The integration of all described features lead to an average performance gain from 0.859 to 0.883. Some abnormalities could not benefit, e.g., cardiac abnormality, where the performance remained constant. However, other abnormalities significantly improved, e.g mass from 0.829 to 0.884. To show the statistical relevance, we applied bootstrapping on all abnormality classes. Figure 7

shows the AUC scores of the baseline model (blue bars) and the joint model (red bars). The 95% confidence intervals were added on each bar (black whiskers).

Fig. 7: AUC scores (bars) and 95% confidence intervals using bootstrapping (whiskers) of all abnormalities for the baseline model and the joint model.

V-D Correlation between Accuracy and the Number of Patients in the Training Set

A general question of any learning task is the required training set size to achieve state-of-the-art performance. In this experiment we focused on the PLCO dataset where a maximum of 70% with 39,302 patients can be used in the training process. Several additional working points are processed after patients were randomly removed from the training set. We see the increasing average performance with increasing number of patients in the training set. However, the curve plateaus at a certain point (Figure 8).

Fig. 8: AUC score over all abnormalities depending on the number of patients in the training set
Fig. 9: AUC score of certain abnormalities depending on the number of images in the training set

Inter-rater Variability Rad. Majority Vote
Nodule 11 464 90 0.307 0.964 0.480 0.313 0.743 31 534 0.845 0.951 0.911
Mass 7 491 67 0.324 0.970 0.560 0.320 0.741 24 541 0.839 0.949 0.891
Granuloma 4 503 58 0.242 0.980 0.740 0.248 0.835 15 550 0.945 1.000 0.975
Infiltrate 7 447 111 0.254 0.959 0.560 0.212 0.765 30 535 0.850 0.898 0.909
Scaring 5 422 138 0.364 0.916 0.580 0.218 0.745 52 513 0.816 0.876 0.892
Fibrosis 15 419 131 0.349 0.935 0.340 0.294 0.752 51 514 0.860 0.959 0.918
Bone/Soft Tissue Lesion 22 326 217 0.343 0.890 0.080 0.207 0.864 82 483 0.794 0.949 0.815
Cardiac Abnormality 17 397 151 0.435 0.898 0.240 0.310 0.918 73 492 0.927 0.963 0.947
COPD 3 470 92 0.232 0.966 0.620 0.175 0.857 22 543 0.886 0.965 0.939
Effusion 5 507 53 0.328 0.975 0.620 0.321 0.866 19 546 0.931 0.999 0.998
Atelectasis 5 432 128 0.323 0.932 0.440 0.208 0.799 43 522 0.795 0.908 0.718
Hilar Abnormality 4 473 88 0.239 0.968 0.660 0.198 0.718 22 543 0.842 0.924 0.893
Mean 0.312 0.946 0.493 0.252 0.800 0.861 0.945 0.901

TABLE IV: Complete agreement of 3 readers - 2 Radiologists and the PLCO Labels (Left). Inter-rater variability measures of the three radiologists (Middle). Network performance evaluated on different ground truth: Single reader, majority vote, complete agreement and uncertainty band (Right).

Focusing on single abnormalities, we gained valuable insight into the difficulty of the training process. Figure 9

shows performance scores of certain abnormalities, depending on the number of images. We can see that all abnormalities exhibit a similar curve structure; however, there is substantial variation between abnormalities in performance as the number of images increases. For a detailed analysis, we consider the two boundary classes: The cardiomegaly class reached the highest performance (top curve) at all working points. In contrast, the same amount of nodule images led to a substantially lower performance (bottom curve). We hypothesize that the performance difference can be explained based on the the abnormality characteristics: The variance of nodules is much higher, e.g., they can vary in shape, brightness, calcification, and other features

[31]. Furthermore, they are much smaller and spread over the whole lung region [32]. Cardiomegaly can always be recognized in the heart area [33]. We also hypothesize that the difference is derived from abnormality dependent levels of label noise. A high amount of errors are caused by “Satisfactory of Search”, meaning that a finding is missed because radiologists stop searching after finding the first abnormality [34].

In conclusion, a similar curve structure can be seen for all abnormalities. Thus, for low frequency abnormalities, e.g. mass and infiltrate, a performance gain can be expected when the training set contains more images including that abnormality. This hypothesis could be corroborated as we included the Chest-Xray14 dataset in Table III. The highest improvement by 0.027 can be seen on mass where the additional images are used.

Vi Multi-reader Validation

Recent analysis [35] has suggested that a large proportion of labels in the ChestX-ray14 dataset are inaccurate. Thus, we conducted an observer study with an additional two radiologists interpreting a subset of the data to measure (a) the performance of the system relative to that majority interpretation of multiple observers, and (b) to define a subset of cases where multiple radiologists agreed on the interpretation.

The experimental setup is originated from the publication by Batliner et al. [36] where different voting strategies were investigated to increase the classification accuracy.

Vi-a Radiologist Agreement

Two board-certified radiologists with more then 15 years of experience in reading CXRs reannotated the data subset. Based on the PLCO data, we created a set containing 50 images per abnormality by random selection, avoiding more than one image per patient. Additionally, we included 50 images without any of those abnormalities. The new set, with a total of 565 images (resulting number of images due to a multi-label dataset) was read by the radiologists, deciding either on the presence or absence of all abnormalities based on the PLCO classes. We treated the PLCO labels as derived from one single reader. At first, we investigated the agreement of the 3 radiologists (2 radiologists + PLCO labels). In Table IV, the positive, negative, and disagreement is listed (Column 1-3). A high number of cases without an agreement were observed.

Vi-B Inter-rater Reliability

We evaluated statistical metrics to quantify consistency among the three ratings. First, we looked on the positive (PPA) and negative (NPA) predictive agreement which analyses the majority between the three radiologists. The values describe the ratio between the number of cases considered positive/negative by majority vote to the number of cases with a positive/negative finding from any of the radiologists. In Table IV (Columns 4 and 5), values are listed for each abnormality. The low PPA values indicate a low agreement on individual findings.

With a positive disagreement (PD) check, we analyzed the variability between the original PLCO labels and our two radiologists (Column 6). PDPLCO is indicated as the number of positive PLCO cases provided that both radiologists disagreed with PLCO, divided by the total number of positive PLCO cases. Given that we included the same number of positive PLCO cases for all abnormalities, the PLCO reader labeled 49.3% of the abnormalities as positive where our both radiologist labeled no finding.

The Fleiss’ kappa value [37] is commonly used to measure the agreement between multiple raters. According to the kappa scale by Landis et al. [38], we have a fair agreement on the average kappa score of 0.252 (Column 7).

All observed parameters point out that individual radiologists include a strong bias when reading CXRs. A weak reliability could be observed, especially, when a reader reported an abnormality finding.

Vi-C Network Performance on refined Ground Truth

For the multi-reader evaluation, the annotated subset is used as test set. The rest of the data is split into training (90%) and validation (10%). Overlapping subjects between the test and training/validation set were removed. Subsequently, the model was retrained. The evaluation can be seen in Table IV, where the network performance is listed based on the PLCO label ground truth (Column 8 of Table IV).

Due to the strong variability between readers, we defined our new ground truth as a majority vote of the three radiologists. Performance scores and the distribution of positive (Pos.) and negative (Neg.) cases are listed in Columns 9-11. The significant average performance gain to 0.861 supports the hypothesis that individual annotators have integrated a high label bias. Moreover, the model which is trained on biased labels (PLCO only), seems to be robust against this label noise.

As a further step, we eliminated cases without a complete agreement between the three radiologists. In Table IV, the test performance of the subset is listed (Column 12). On average, we reached a performance of 0.945.

With 50 positive PLCO cases per abnormality and the resulting low number of a complete agreement e.g. COPD (50 to 3), we speculate that the strong label bias in the PLCO dataset were derived from the usage of prior knowledge.

Fig. 10: Removal of cases within the band. The parameter is defined as the threshold of positive and negative predicted cases and parametrized to have the same number of false positive and false negative cases. The width and are set such that the band contains % of true positives and % of true negatives.

User Agreement versus Algorithm Confidence: Under a hypothesis that radiologist agreement might correlate with the confidence of the algorithm, we investigated whether a band of algorithm probabilities could be defined from the validation set that would separate confidently true negatives, uncertain algorithm findings, and confidently true positives on unseen data.

Based on multiple radiologists, we defined confidences for each class in each image. Four different confidence categories were created:

  • High negative

  • Low negative

  • High positive

  • Low positive

The confidence value for all classes and images depends on the number of positive label annotations of the three radiologists. An abnormality class classified as low positive confidence is derived from two readers reporting an abnormality where one reader reported no finding.

The AUC score is based on a threshold which is shifted over the positive and negative distribution. Dealing with high and low confident cases we defined an additional band over the output prediction range. Low confident cases may be located within the band and be misclassified. All samples included in the band were eliminated and, subsequently, the performance is measured on the reduced set. By including the band, we achieved the following contributions:

  • We significantly improved the AUC performance after removing cases within the band.

  • We observed that for most abnormalities the majority of cases included within the band were low confidence cases.

Negative Positive
High Conf. Low Conf. Low Conf. High Conf.

0/3 Positive 1/3 Positive 2/3 Positive 3/3 Positive
B / A / % B / A / % B / A / % B / A / %
Nod. 464 / 306 / 66 70 / 32 / 46 20 / 7 / 35 11 / 8 / 72
Mass 491 / 307 / 63 50 / 18 / 36 17 / 5 / 29 7 / 2 / 29
Gran. 503 / 365 / 73 47 / 27 / 57 11 / 9 / 82 4 / 4 / 100
Inf. 447 / 297 / 66 88 / 38 / 43 23 / 8 / 35 7 / 3 / 42
Scar. 422 / 274 / 65 91 / 49 / 54 47 / 23 / 49 5 / 4 / 80
Fibr. 419 / 318 / 76 95 / 35 / 37 36 / 14 / 39 15 / 11 / 73
Les. 326 / 265 / 81 157 / 114 / 72 60 / 38 / 63 22 / 17 / 77
Card. 397 / 334 / 84 95 / 53 / 56 56 / 20 / 36 17 / 9 / 53
CO. 470 / 353 / 75 73 / 32 / 44 19 / 9 / 47 3 / 1 / 33
Eff. 507 / 303 / 60 39 / 6 / 15 14 / 1 / 7 5 / 4 / 80
Atel. 432 / 284 / 66 90 / 31 / 34 38 / 7 / 18 5 / 1 / 20
Hil. 473 / 312 / 66 70 / 32 / 46 18 / 7 / 39 4 / 1 / 25
Avg. 446 / 310 / 70 80 / 39 / 49 30 / 12 / 40 9 / 5 / 56

TABLE V: Absolute values of cases Before (B) and After (A) the removal with the uncertainty band based on the defined confidence classes. The last value shows the percentage (%) of cases maintained after band removal.

The band was defined as an interval . The parameters indicate the threshold of positive and negative cases for abnormality and the sum of and the width. The threshold is set such that we have equal false positive and false negative ratios for each abnormality. We allowed a maximum of % of true positives and % of true negatives to be inside the band. The values and are parametrized such that the condition is fulfilled. The parameters were found on the validation set, the evaluation is on the test set based on majority vote of the 3 readers. For our experiment, we used = = 20. An example scenario is illustrated in Figure 10. The band contains almost all false positive and false negative cases.

After elimination of cases within the band, the performance values are shown in Table IV (last column) reaching 0.901 on average. Thus, a significant performance gain could be achieved when disregarding these cases.

After removing the cases within the band, the number of eliminated/maintained cases were analyzed with respect to the four defined confidence classes. Table V shows the absolute numbers before (B) and after (A) elimination of cases within the band. The last value indicates the percentage (%) of each abnormality maintained after reducing the set. Most of the abnormalities show that significantly more high confident images were maintained after removal, e.g. fibrosis: 76% of the high confident negative cases and 37% of the low confident negative cases as well 73% of high confident positive cases and 39% of low confident positive cases could be maintained.

Vii Discussion

We observed that our multi-task convolutional neural network is able to correctly classify a wide range of abnormalities in chest X-ray images. During the learning process of abnormality classification, the system was supported by additional features, e.g., spatial information and normalization. Due to the classification of multiple abnormalities with different characteristics, we found features which improved the AUC performance for most abnormalities. The combination of all integrated features lead to a significant average performance gain.

The low agreement of multiple readers disclosed the high bias between radiologists. Less confidence should be given to any individual dataset labels in future. If available, a combination of more radiologists should be always considered. A majority vote of three radiologists showed significantly improved results. We hypothesize that even more radiologists are necessary for a more trustful and unbiased evaluation.

An evaluation of cases with a complete agreement between the radiologists additionally improved the performance. This improvement may be caused due to the elimination of challenging examples. With the help of the uncertainty band, a correlation between network prediction and radiologist agreement could be determined. Including this feature, a flag (in addition to presence/absence of an abnormality) could be returned by the system whether a certain case is band in- or outlying. The reading time of cases outside the band and, hence, high confident cases, could be reduced and used for cases with low confidence.

Viii Conclusion

Overall, we developed a multi-task network for the classification of 12 different abnormalities, classification of their location, and segmentation of lung lobes and heart. The system was trained and evaluated on a data collection of 297,541 images. With additional information of lung and heart segmentations and spatial labels, as well as by using an adaptive normalization strategy, we significantly improved the abnormality classification performance to an average AUC score of 0.883 on 12 different abnormalities. An agreement study between the original labels and additional radiologist annotations indicated a large fraction of label noise. Re-labeling the original data annotations based on multi-reader voting showed a significant performance gain. By additionally removing uncertain cases we reached an average AUC performance of 0.945. Furthermore, we demonstrated that the derived confidence scores of most annotations are highly correlated with the prediction of the network. An uncertainty band integrated over the prediction range of the model filters out low confident cases for


The authors thank the National Cancer Institute (NCI) for access to their data collected by the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial. The authors thank the National Institutes of Health (NIH) for access to the ChestX-ray14 collection. The statements contained herein are solely those of the authors and do not represent or imply concurrence or endorsement by NCI or NIH.
Disclaimer: The concepts and information presented in this paper are based on research results that are not commercially available.


  • [1] W. Li, P. Cao, D. Zhao, and J. Wang, “Pulmonary nodule classification with deep convolutional neural networks on computed tomography images,” Computational and Mathematical Methods in Medicine, vol. 2016, pp. 1–7, 2016.
  • [2]

    F. Ghesu, B. Georgescu, Y. Zheng, S. Grbic, A. Maier, J. Hornegger, and D. Comaniciu, “Multi-scale deep reinforcement learning for real-time 3D-landmark detection in CT scans,”

    IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 1, pp. 176–189, 2019.
  • [3] W. Zhu, C. Liu, W. Fan, and X. Xie, “Deeplung: Deep 3D dual path nets for automated pulmonary nodule detection and classification,”

    2018 IEEE Winter Conference on Applications of Computer Vision (WACV)

    , pp. 673–681, 2018.
  • [4] J. D. Fauw, J. R. Ledsam, B. Romera-Paredes, S. Nikolov, N. Tomasev, S. Blackwell, H. Askham, X. Glorot, B. O’Donoghue, D. Visentin, G. van den Driessche, B. Lakshminarayanan, C. Meyer, F. Mackinder, S. Bouton, K. Ayoub, R. Chopra, D. King, A. Karthikesalingam, C. O. Hughes, R. Raine, J. Hughes, D. A. Sim, C. Egan, A. Tufail, H. Montgomery, D. Hassabis, G. Rees, T. Back, P. T. Khaw, M. Suleyman, J. Cornebise, P. A. Keane, and O. Ronneberger, “Clinically applicable deep learning for diagnosis and referral in retinal disease,” Nature Medicine, vol. 24, no. 9, pp. 1342–1350, 2018.
  • [5] A. Rajkomar, E. Oren, K. Chen, A. M. Dai, N. Hajaj, M. Hardt, P. J. Liu, X. Liu, J. Marcus, M. Sun, P. Sundberg, H. Yee, K. Zhang, Y. Zhang, G. Flores, G. E. Duggan, J. Irvine, Q. Le, K. Litsch, A. Mossin, J. Tansuwan, D. Wang, J. Wexler, J. Wilson, D. Ludwig, S. L. Volchenboum, K. Chou, M. Pearson, S. Madabushi, N. H. Shah, A. J. Butte, M. D. Howell, C. Cui, G. S. Corrado, and J. Dean, “Scalable and accurate deep learning with electronic health records,” npj Digital Medicine, vol. 1, no. 1, 2018.
  • [6] M. A. Bruno, E. A. Walker, and H. H. Abujudeh, “Understanding and confronting our mistakes: The epidemiology of error in radiology and strategies for error reduction,” RadioGraphics, vol. 35, no. 6, pp. 1668–1676, 2015.
  • [7] A. P. Brady, “Error and discrepancy in radiology: Inevitable or avoidable?” Insights into Imaging, vol. 8, no. 1, pp. 171–182, 2016.
  • [8] H. B. Fleishon, M. Bhargavan, and C. Meghea, “Radiologists’ reading times using pacs and using films: One practice’s experience,” Academic Radiology, vol. 13, no. 4, pp. 453 – 460, 2006.
  • [9] S. Gündel, S. Grbic, B. Georgescu, S. Liu, A. Maier, and D. Comaniciu, “Learning to recognize abnormalities in chest X-rays with location-aware dense networks,” in Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, 2019, pp. 757–765.
  • [10] X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers, “ChestX-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3462–3471.
  • [11] M. T. Islam, M. A. Aowal, A. T. Minhaz, and K. Ashraf, “Abnormality detection and localization in chest X-rays using deep convolutional neural networks,” CoRR, vol. abs/1705.09850, 2017.
  • [12] P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Ding, A. Bagul, C. Langlotz, K. Shpanskaya et al., “Chexnet: Radiologist-level pneumonia detection on chest X-rays with deep learning,” vol. abs/1711.05225, 2017.
  • [13] G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2261–2269.
  • [14] L. Yao, E. Poblenz, D. Dagunts, B. Covington, D. Bernard, and K. Lyman, “Learning to diagnose from scratch by exploiting dependencies among labels,” CoRR, vol. abs/1710.10501, 2017.
  • [15] Q. Guan, Y. Huang, Z. Zhong, Z. Zheng, L. Zheng, and Y. Yang, “Diagnose like a radiologist: Attention guided convolutional neural network for thorax disease classification,” CoRR, vol. abs/1801.09927, 2018.
  • [16] J. Rubin, D. Sanghavi, C. Zhao, K. Lee, A. Qadir, and M. Xu-Wilson, “Large scale automated reading of frontal and lateral chest x-rays using dual convolutional neural networks,” CoRR, vol. abs/1804.07839, 2018.
  • [17] C. Yan, J. Yao, R. Li, Z. Xu, and J. Huang, “Weakly supervised deep learning for thoracic disease classification and localization on chest X-rays,” in Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, 2018, pp. 103–110.
  • [18] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132–7141.
  • [19] H. Wang and Y. Xia, “ChestNet: A deep neural network for classification of thoracic diseases on chest radiography,” CoRR, vol. abs/1807.03058, 2018.
  • [20] J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple object recognition with visual attention,” CoRR, vol. abs/1412.7755, 2014.
  • [21] Z. Li, C. Wang, M. Han, Y. Xue, W. Wei, L. Li, and L. Fei-Fei, “Thoracic disease identification and localization with limited supervision,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 8290–8299.
  • [22] H. Liu, L. Wang, Y. Nan, F. Jin, and J. Pu, “SDFN: Segmentation-based deep fusion network for thoracic disease classification in chest x-ray images,” 2018.
  • [23] L. Yao, J. Prosky, E. Poblenz, B. Covington, and K. Lyman, “Weakly supervised medical diagnosis and localization from multiple resolutions,” CoRR, vol. abs/1803.07703, 2018.
  • [24] P. Rajpurkar, J. Irvin, R. L. Ball, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Ding, A. Bagul, C. P. Langlotz, B. N. Patel, K. W. Yeom, K. Shpanskaya, F. G. Blankenberg, J. Seekins, T. J. Amrhein, D. A. Mong, S. S. Halabi, E. J. Zucker, A. Y. Ng, and M. P. Lungren, “Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists,” PLOS Medicine, vol. 15, no. 11, pp. 1–17, 11 2018.
  • [25] J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. Ball, K. Shpanskaya et al., “Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison,” in

    Thirty-Third AAAI Conference on Artificial Intelligence

    , 2019.
  • [26] J. K. Gohagan, P. C. Prorok, R. B. Hayes, and B.-S. Kramer, “The prostate, lung, colorectal and ovarian (PLCO) cancer screening trial of the national cancer institute: History, organization, and status,” Controlled clinical trials, vol. 21, no. 6, pp. 251S–272S, 2000.
  • [27] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet large scale visual recognition challenge,” IJCV, vol. 115, no. 3, pp. 211–252, 2015.
  • [28] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.
  • [29] R. H. H. M. Philipsen, P. Maduskar, L. Hogeweg, J. Melendez, C. I. Sánchez, and B. van Ginneken, “Localized energy-based normalization of medical images: Application to chest radiography,” IEEE Transactions on Medical Imaging, vol. 34, pp. 1965–1975, 2015.
  • [30] S. Dippel, M. Stahl, R. Wiemker, and T. Blaffert, “Multiscale contrast enhancement for radiographies: Laplacian pyramid versus fast wavelet transform,” IEEE Transactions on Medical Imaging, vol. 21, no. 4, pp. 343–353, April 2002.
  • [31]

    N. Lingayat and M. Tarambale, “A computer based feature extraction of lung nodule in chest X-ray image,”

    International Journal of Bioscience, Biochemistry and Bioinformatics, vol. 3, pp. 624–629, 2013.
  • [32] M. K. Gould, J. Donington, W. R. Lynch, P. J. Mazzone, D. E. Midthun, D. P. Naidich, and R. S. Wiener, “Evaluation of individuals with pulmonary nodules: When is it lung cancer?: Diagnosis and management of lung cancer, 3rd ed: American college of chest physicians evidence-based clinical practice guidelines,” Chest, vol. 143, no. 5, Supplement, pp. e93S – e120S, 2013.
  • [33] E. Kwadwo Kwakye Brakohiapa, B. Botwe, B. Dabo Sarkodie, E. Kwesi Ofori, and J. Coleman, “Radiographic determination of cardiomegaly using cardiothoracic ratio and transverse cardiac diameter: Can one size fit all? part one,” Pan African Medical Journal, vol. 27, 2017.
  • [34] Y. W. Kim and L. T. Mansfield, “Fool me twice: Delayed diagnoses in radiology with emphasis on perpetuated errors,” vol. 202, no. 3, 2014, pp. 465–470.
  • [35] L. Oakden-Rayner, “Exploring the chestxray14 dataset: problems,”, accessed: 2017-12-18.
  • [36] A. Batliner, S. Steidl, C. Hacker, E. Nöth, and H. Niemann, “Tales of Tuning - Prototyping for Automatic Classification of Emotional User States,” in Proceedings of the 9th European Conference on Speech Communication and Technology, ISCA, Ed., Bonn, 2005, pp. 489–492.
  • [37] J. L. Fleiss, “Measuring nominal scale agreement among many raters,” Psychological Bulletin, vol. 76, pp. 378–, 11 1971.
  • [38] J. R. Landis and G. G. Koch, “The measurement of observer agreement for categorical data.” Biometrics, vol. 33 1, pp. 159–74, 1977.