With the already high and most likely increasing demand, chest radiography is today the most common examination type in radiology departments [CQC2018]. As reported by [beardmore2016radiography], the average report turnaround time for plain X-ray is about 34 hours while have a turnaround time less than hours. In case of critical findings such as pneumothorax or pleural effusion the integration of automated detection systems in the clinical work-flow could have a substantial impact on the quality of care.
Recent developments in pathology classification focused mainly on specific aspects of Deep Learning (e.g. in terms of novel network architectures). Early on, Shin et al. [Shin2016] demonstrated that a Convolutional Neural Network (CNN) combined with a recurrent part can be applied for image captioning in chest X-rays. The increased availability of annotated chest X-ray datasets like ChestX-ray14 [Wang2017] helped to accelerate the progress in the field of pathology classification, detection and localization.
In this rapidly evolving field, Li et al. [Li2018] presented a unified network architecture for pathology classification and localization, while only limited annotation is needed for the localization. Cai et al. [Cai2018IterativeAM] proposed an attention mining method for disease localization which works without localization annotation. In the work of Wang et al. [wang2018tienet], a classification and reporting method – leveraging the radiologist report in addition to the image – was presented.
In this context, only very simple pre-processing steps have been employed. Motivated by prior work in the computer vision domain this includes predominantly intensity normalization as well as a re-scaling of the image to the model size. Contrary, over the last years, several methods have been developed for supporting radiologists in the diagnostic process. Two well known techniques are bone suppression and lung field detection[von2016novel, von2016decomposing]; the former artificially removes the rib cage facilitating the detection of small appearing pathologies and the later standardizes viewing appearance. In multiple studies, the usefulness of such image processing methods for different diseases was shown [Li2012]. An obvious question arises: do bone suppression and lung field detection have the same beneficial effect on disease classification with CNNs?
Toward this end, we investigate how bone suppression and lung field detection can be exploited as a pre-processing step for a CNN.
In a methodology comparable way to [Gordienko2018], we apply pre-processing in three different scenarios. First, processing each image with bone suppression. Secondly, cropping the images to detected lung fields and finally, combining both processing steps. However, different to [Gordienko2018], we use lung field detection to crop the images to the important area, whereas Gordienko et al. kept the image size equal and just set regions (not belonging to the lung fields) to zero. We believe cropping can increase the CNN performance as it increases the effective spatial resolution for the CNN. Furthermore, we propose a novel ensemble architecture to leverage the complimentary information in the different images, similar to a radiologists work-flow. Furthermore, in order to allow for a detailed assessment of the impact for specific pathologies, two expert radiologists, annotated, the public Indiana dataset (Open-I) with respect to eight findings.
Following the method and training setup in [baltruschat2018comparison], we pre-trained a ResNet-50 architecture with a larger input size of
on ChestX-ray14. Compared to different network architectures and training strategies, the obtained model achieved the highest average AUC value in our previous experiments. Due to the focus on eight specific pathologies, we replaced the last dense layer of the converged model with a new dense layer having eight outputs and a sigmoid activation function. Furthermore, we applied a fine-tuning step in order to adapt the model to the new image domain.
2.1 Bone Suppression
In the original Indiana images we suppress the bones (ribs and clavicles) using a method from [von2016novel, von2016decomposing]. The method preserves the remaining details originally overlaid with the bones (see Fig. (b)b
). In the reported reader study, the AUC for the detection and localization of lung nodules increased for experienced human readers when using bone suppression images. Machine learning may potentially also benefit from suppressing some normal anatomy, which is to be tested here.
2.2 Lung Fields
Lung fields are segmented using a foveal CNN as described in [brosch2018foveal]. It is trained by semi-automatically annotated lung fields and applied to the Indiana images. After the initial lung field detection, we apply post-processing steps to determine the final crop area. First, we identified all connected regions and computed a bounding box around the two largest region. Thereafter, we added a small border of pixel to the top/bottom and to the left/right. Each image is cropped to its individual bounding box as pre-processing step (see Fig. (c)c). Lung field cropping has two beneficial aspects. First, it reduces the amount of information loss due to down scaling and secondly, it is a geometric image normalization. We also consider a combination of both – bone suppression and lung field cropping (Fig. (d)d).
In many applications combining different predictors can lead to improved classification results, which is known as ensemble forming [Hansen1990, krogh1995neural]. Ensembling can be done in several ways and with any number of predictors. To determine whether the combination of several predictors could improve results, the Pearson correlation coefficient can be used. Ensembling predictors with high correlation coefficient will likely not improve results a lot compared to predictors with lower correlation.
Methods for ensemble generation include averaging and majority voting as well as machine learning algorithm like Support Vector Machines (SVMs). Since an ensemble approach will typically outperform an individual model, we compare not only individual models (trained with a specific pre-processing) to an ensemble trained on different images. Instead, we compare also an ensemble with models trained on images without pre-processing to a ensemble with pre-processed trained models. In order to limit the complexity of the experimental setup, we focus on averaging approach.
3 Indiana Dataset
The Indiana dataset from Open-I contains 3996 studies with DICOM images [Demner-Fushman2015]. In a first step, we created a revised dataset, by removing studies with no associated images or labels (i.e. the reference annotation). Next, studies that lacked either frontal or lateral acquisition were removed. The final dataset consists of 3125 studies. Two expert radiologists from our department reviewed all cases and diagnosed, which findings are present using the frontal as well as the lateral acquisition. As shown in Table 1, we have selected eight different findings for annotation: pleural effusion, infiltrate, congestion, atelectasis, pneumothorax, cardiomegaly, mass, and foreign object.
Intra-observer variability is common in chest X-rays. Thus, after an individual assessment of the images, all disagreements were discussed and a final consensus annotation was found. Table 1 shows the distribution of each finding. All classes except pneumothorax have more than 100 positive cases, whereas the class pneumothorax only has eleven positive cases. In our final evaluation, we do report results but will not discuss them for pneumothorax because of the low number of positive cases.
We re-sampled 5 times from the entire Indiana dataset for an assessment of the generalization performance[Molinaro2005]
. Each time, we split the data into 70% training and 30% testing. We calculated the average loss over all re-samples to estimate the best point for generalization. Finally, our results are calculated for each split on the test set and averaged afterwards.
4 Experiments and Results
Implementation: Following the experimental setup in [baltruschat2018comparison], we employed an adapted ResNet-50, which is tailored to the X-ray domain. After replacing the dense layer, the model was fine-tuned using the Indiana dataset. For training, we sample various sized patches of the image with sizes between and of the image area. The patch aspect ratio is distributed evenly between and . In addition, each image is randomly horizontal flipped and randomly rotated between . At testing, we resize images to and use an averaged five crop (i.e. center and all four corners) evaluation. In all experiments, we use ADAM [Kingma2015] as optimizer with default parameters for and . The learning rate is set to
. While training, we reduce the learning rate by a factor of 2 when the validation loss does not improve. We use a batch size of 15 and binary cross entropy as a loss function. The models are implemented in CNTK and trained on GTX 1080Ti GPUs.
We perform six different experiments based on our proposed image pre-processing (Section 2.2 and 2.1). First, we train on normal images (i.e. no pre-processing is employed), bone suppressed images, lung cropped images, and on images combining both pre-processing steps. Secondly, we build an ensemble upon four normal trained models ”EN-normal” as a baseline ensemble. Finally, we us our pre-processed trained models to build an ensemble ”EN-pre-processed”.
AUC result overview for all our experiments. In this table, we present averaged results over all 5 splits and the calculated standard deviation (SD) for each finding. Furthermore, the average (AVG) AUC over all findings is shown. We trained our model with four different input images. First, normal images. Secondly, ”BS” means with bone suppressed images. Thirdly, ”Lung” means with images cropped to lung fields. Fourthly, ”BS+Lung” means with bone suppressed and cropped to lung fields. In addition, we formed an ensemble with models trained on normal images ”EN-normal” and an ensemble with the models trained on pre-processed images ”EN-pre-processed”. Bold text emphasizes the overall highest AUC value.We excluded pneumothorax because of the low positive count.
To compare our experiments to each other, we calculate the area under the ROC curve (AUC). The shown AUC results are averaged over all re-sampling and presented with standard deviation (SD). For our ensemble experiment, we calculate the Pearson correlation coefficient between each normal trained model and our pre-processed trained models.
First, we look at our experiments with the different pre-processed images and the performance based on AUC. In all experiments, we note that five out of seven relevant classes have a high AUC of above . Two of those five pleural effusion and cardiomegaly have even an AUC of above . Only the class mass and foreign object have an AUC below . Comparing the results of a model using bone suppression to the normal trained model, the AUC for foreign object increased significant from to with respect to the reported standard deviation (SD). The model trained with lung cropping has in all classes a higher AUC and often a reduced SD compared to the baseline. But only for the class mass, the AUC increased significantly from to . We argue that the increased spatial resolution for lung cropped images helps the model to better detect small masses. This is in line with the observation of our radiologists, which reported an increased number of small masses. Combining both pre-processing steps results in the highest AUC for mass and increases the AUC by . We observe no significant changes for the other classes.
Secondly, we build two ensembles: EN-normal and EN-pre-processed. EN-Normal refers to our ensemble of four models trained using images without pre-processing. Whereas EN-pre-processed is an ensemble with one normal, one BS, one lung cropping, and one combined model. In figure 2, we report the Pearson correlation coefficient for the normal and pre-processed ensemble. As expected, the four normal models are already highly correlated (i.e. values around 96) except for one model which seems to converged to a different optimum. Comparing the Pearson correlation coefficients of the pre-processed trained models with the normal trained models, the coefficient are lower and only around 85. This indicates that a pre-processed ensemble can have a high impact on our results. We verify our hypothesis with the AUC results in Table 2. The pre-processed ensemble increases the AUC in mass, foreign object, and atelectasis significantly with respect to the reported SD, whereas the normal ensemble does not. Overall, the pre-processed ensemble yields in five of seven classes the best AUC results.
In this contribution we investigated the effect of two advanced pre-processing methods for multi-label disease classification on chest radiographs: bone suppression and lung field detection. In a systematic evaluation, we showed and discussed the superior performance of models – trained on pre-processed images. The best performance was achieved by a novel ensemble architecture leveraging all the information from the different pre-processing methods. Significant AUC improvement for specific classes like foreign object and mass have been achieved, but there is still work needed for a clinical application. Our future work will include detailed investigation of clinical application scenarios and the integration of disease segmentation for multi-label classification.