Heart failure (HF) is highly prevalent in different populations. As such, its prevalence varies from 1% to 14% of the population according to the data from Europe and the United States [dunlay2017epidemiology]. One clinical factor having impact on the diagnosis of HF is cardiomegaly [dunlay2017epidemiology] which is a condition affecting heart enlargement. In clinical practice, assessment of cardiomegaly is trivial to a human expert (radiologist) and typically done by a visual assessment. However, there are multiple clinical scenarios when the radiologist is not available, for example in emergency care or intensive care units.
Clinically accepted quantitative measure of cardiomegaly is cardiothoratic index (CTI) – a ratio of the heart’s and the lungs’ widths. In the literature, CTI is also often called a cardiothoratic ratio (CTR). CTR can be measured from chest radiographs which constitute over a half of radiographic imaging done in clinical practice [dai2018scan].
Multiple recent studies have demonstrated promising results in assessing chest and other radiographs by applying Deep Learning (DL) [tiulpin2019multimodal, tiulpin2018automatic, wang2019grey]. These efforts indicate a possibility of reducing the amount of human effort needed for visual image assessments. Ultimately, this technology has potential to reduce the health care costs while keeping the same quality of diagnosis [saba2019present].
DL is a methodology of learning hierarchical representations directly from data [schmidhuber2015deep]
. Typically, these representations (features) are learned with respect to the task, such as image classification or segmentation. The latter allows to classify image pixels individually and eventually obtain the locations and boundaries of the objects within an image. DL-based image segmentation was shown to be a core technique in assessing CTR from chest X-rays[dong2018unsupervised, li2019automatic]. However, none of the existing CTR assessment or chest X-ray segmentation methods allow to obtain the model uncertainty which is crucial in clinical practice.
In this paper, we propose a robust Bayesian segmentation-based method for CTR estimation which predicts pixel-wise class labels with a measure of model uncertainty. Our approach is based on Feature Pyramid Network (FPN) [lin2017feature, seferbekov2018feature] with Resnet-50 backbone [he2016deep] and instance normalization in the decoder [ulyanov2016instance]. For uncertainty estimation, we follow [kendall2015bayesian] and utilize Monte-Carlo (MC) dropout at test time. Schematically, the proposed approach is illustrated in Fig 1.
The main contributions of this paper are:
We extend traditional DL-based methods for CTR estimation to Bayesian neural network which can predict pixel-wise class labels and uncertainty bounds from segmentation masks.
Compared to all the previous studies, we propose a challenging training dataset with diverse radiological findings annotated by a radiologist.
The model evaluation is performed on 3 widely-used public X-ray image datasets which were re-annotated in a similar way to our training dataset, but come from different scanners and hospitals.
To the best of our knowledge, this is the first work that uses Bayesian DL in both chest X-ray segmentation and CTR estimation domains. Our methodology allows to assess the uncertainty of the model at test time, thereby providing clinical value in potential applications.
Finally, we publicly release the annotations and the training dataset utilized in this study. We think that these data could set up a new, more challenging benchmark in chest X-ray segmentation.
2 Related Work
Chest X-ray Segmentation.
The most relevant studies to ours are by Dong et al [dong2018unsupervised], by Dai et al. [dai2018scan] and also by Elsami et al. [eslami2019image]. They introduced adversarial training to enforce the consistency between the predictions and the ground truth annotations. Both studies explore the same methodology while the former one is mainly focused on CTR estimation and uses the adversarial training for unsupervised domain adaptation (UDA), the latter is rather targeting segmentation of chest X-ray. The methods demonstrate that better generalization performance to unseen data can be achieved by using adversarial training.
Besides CTR estimation realm, there are other studies approaching the segmentation problem of chest X-ray images by applying DL. Arbabshirani et al. [arbabshirani2017accurate] and recent works [chen2018semantic, souza2019automatic] demonstrated remarkable performance in lungs segmentation. Furthermore, Wessel et al. [wessel2019] utilized mask R-CNN [mask_rcnn] to successfully localize, segment and label individual ribs.
From the segmentation field point of view in general, there exist multiple studies that use FPN as a decoder for image segmentation [gao2018end, kirillov2019panoptic, rakhlin2019breast]. In particular, the study by Seferbekov et al. [seferbekov2018feature]
explores a very similar architecture to ours and seems to be the first to demonstrate a combination of ImageNet-pretrained Resnet-50 encoder with FPN decoder successfully applied to image segmentation.
In Bayesian segmentation, we note the study by Kendall et al. [kendall2015bayesian] that introduced the use of MC dropout for the uncertainty estimation in image segmentation. Furthermore, the recent study by [mukhoti2018evaluating] proposed to use a modification of DeepLab-v3+ [chen2018encoder] that allowed to achieve state-of-the-art segmentation results and at the same time obtain uncertainty estimates.
Limitations of the Existing Chest X-ray Segmentation Datasets.
In this paragraph, we also tackle an important issue of existing annotations and images in CTR assessment and Chest X-ray segmentation realm. In particular, all the existing DL-based CTR estimation aforementioned methods have been trained on the datasets that do not include the true boundaries of the anatomical structures within the chest X-rays. While this does not have significant impact on CTR estimation in general, the absence of the true boundaries of heart and lungs limits the scope of applications that can be built using the automatic Chest X-ray segmentation (e.g. detection of plural effusion). Moreover, the existing datasets were originally from tuberculosis (TB) domain which also limits the reliable testing of the segmentation and CTR assessment models. We argue that clinically applicable methods must be trained and tested on the datasets that have diverse radiological findings.
The proposed method is based on a combination of several state-of-the-art techniques for image segmentation. In particular, our approach leverages the power of encoder pre-trained on ImageNet [deng2009imagenet], Feature Pyramid Networks-inspired decoder [lin2017feature, seferbekov2018feature], instance normalization [ulyanov2016instance] and MC dropout at test time [kendall2015bayesian]. The architecture of the proposed approach is illustrated in Fig. 2.
We used a standard Resnet-50 pre-trained on ImageNet [deng2009imagenet]
. We do not freeze the encoder during training and merely use it as is from the beginning. It is worth to note that our pre-trained encoder follows a Batch Normalization (BN) layer that learns the mean of the dataset during the training. Furthermore, we inserted the dropouts with a probabilitybefore the second, third and the fourth residual blocks of ResNet50 (see Fig. 2).
The decoder is a standard FPN. However, similarly to Seferbekov et al. [seferbekov2018feature], we do not use intermediate supervision and prediction at each layer of the feature pyramid as it is usually done for object detection [lin2017feature]. As the sizes of the feature pyramid and the input image do not match, we use nearest neighbours upsampling. In contrast to [seferbekov2018feature], we replace each BN layer in the decoder by instance normalization (IN) layer at every level of the feature pyramid.
In addition to dropout units in the backbone, we also apply them after each convolutional block of the feature pyramid as illustrated in Fig. 2. More specifically, the dropouts have been used in the decoder only before upsampling layers.
As the task of segmenting the chest X-ray structures is multi-label, rather than multi-class, the decoder has 2 outputs, where the first plane corresponds to the heart and the second one to the lungs. Before the final output layer we used a spatial dropout with a rate of .
Bayesian Segmentation Framework: Monte-Carlo Dropout.
As mentioned previously, we leverage MC-dropout technique [gal2016dropout, kendall2015bayesian]. To capture the model’s uncertainty, it is necessary to estimate the posterior distribution of the model’s weights given the train images and the corresponding labels . However, this distribution is intractable, therefore, its variational approximation [mukhoti2018evaluating]. Gal and Ghahramani [gal2016dropout]
have shown that training a neural network with dropout and standard cross-entropy loss function leads to minimization of Kullback-Leiber (KL) divergence betweenand :
is chosen to be a Bernoulli distribution.
In our experiments, we enabled the dropout layers in both encoder and FPN, respectively. We then performed the sampling of pixel-wise probability masks similarly to Kendall et al [kendall2015bayesian]. Here, for every pixel of the input image , having MC-dropout iterations, we generate the prediction at every MC dropout iteration and eventually estimate to produce the segmentation masks and for the heart and the lungs, respectively.
Bayesian Segmentation Framework: Aleotoric and Epistemic Uncertainties.
Besides the segmentation masks, the proposed framework also produced uncertainty estimates per pixel. As such, we computed both aleotoric and epistemic uncertainties. Briefly, the former one captures the inherent noise in the data (sensor noise) while the latter exhibits the model’s uncertainty. Both of these uncertainties are important as aleotoric uncertainty allows to estimate the need of improving the sensor precision and the epistemic uncertainty enables to assess the need of larger training dataset [mukhoti2018evaluating]. Similarly to Mukohti and Gal [mukhoti2018evaluating], we approximated the aleotoric uncertainty for the test examples given the train data as a predictive entropy :
and the epistemic uncertainty was approximated as a mutual information between the predictive distribution and the posterior over the weights of the model:
In our experiments, we estimated both and for each image (similarly, also and ). For visualization purposes, we displayed the summed entropy and mutual information masks, respectively.
Following [dong2018unsupervised], we used the same method for CTR estimation. More specifically, CTR is defined as the ratio of the widest diameter of the lungs and the heart as illustrated in Fig. 3).
Training of the model.
During the training, we used the combination of binary cross-entropy (BCE) and soft-Jaccard loss (J) as done in various other studies [seferbekov2018feature]:
where are the model’s weights, are the images and and are the ground truth masks for the heart and the lungs, respectively.
During the training process, we use various data augmentation techniques to improve the robustness and decrease possible overfitting. In particular, we used random-sized crop and degrees rotation as our main data augmentations (with a probability of per image). Noise addition, blur and and sharpening were used as secondary augmentations ( per image). Finally, elastic distortions, Random Brightness, JPEG compression and horizontal flips were used with a probability of to regularize the images with hard cases.
All our models were trained with the image resolution of pixels. To train the models, we used minibatch of , learning rate of and spatial dropout rate of
. Our experiments were conducted in Keras[chollet2015keras] with Albumentations library [buslaev2018albumentations] used for data augmentation.
4 Experiments and Results
Our training set was derived from ChestXray14 dataset [Wang_2017_CVPR]. The original dataset included chest radiographs from patients, while the train data used in this study comprised images randomly sampled from these data. For training, we used images images from distinct patients. images from patients were used for validation and the remaining images from patients were eventually used as a test set.
The proposed multi-label dataset has the following findings and labels (the number of samples for each label is given in parenthesis): Cardiomegaly (), Emphysema (), Effusion (), Hernia (), Infiltration (), Nodule (), Atelectasis ), Mass (), Pneumothorax (), Pneumonia (), Pleural thickening (), Fibrosis (), Consolidation (), Edema (). There are samples with no findings. Such label distribution makes our data more challenging for segmentation compared to the previous studies.
It is worth to mention that the original data were provided with the labels mined from the radiology reports, however the dataset did not have any segmentation masks. Our radiologist (radiologist A) annotated the train, validation and the test sets. An example of the annotations is presented in Fig. 4. Compared to the other existing datasets illustrated in the same figure, our annotations delineate the true anatomical contours which makes the segmentation more challenging.
4.2 Auxiliary Test Datasets
Re-annotation of the existing datasets
We evaluated our method on three auxiliary test sets each derived from three independent public datasets, respectively (see below). The original annotations for these datasets did not include the true boundaries of the lungs underlying the heart, or had missing annotations of the heart. To evaluate the performance of our method trained on our datasets with true lung boundaries, practicing radiologist B having the same experience to the radiologist A, annotated random images for the test evaluation from each of the auxiliary test datasets. The comparison between the original annotations in the test datasets and our annotations is presented in Fig. 4.
Japanese Society of Radiological Technology dataset was first released in [shiraishi2000development] and complemented by the annotations from [van2002automatic]. The dataset contains 247 radiographs in total, having 154 radiographs with and 93 radiographs without lung nodules, respectively.
Montgomery County X-ray Set.
This dataset contains chest radiographs acquired by the Department of Health and Human Services, Montgomery County, Maryland, USA [jaeger2014two]. Out of the total amount, subjects had and subjects did not have tuberculosis (TB).
Shenzhen Hospital X-ray Set.
Shenzhen dataset is similar to Montgomery dataset as it includes the data from subjects with and without TB. These data were acquired at Shenzhen No.3 Hospital in Shenzhen, Guangdong providence, China. The total dataset size was of images having images from patients with TB [jaeger2014two].
4.3 Ablation study
This section describes the ablation study conducted on deterministic models. After selecting the best deterministic model, we inserted the dropout as shown in Fig. 2. We first investigated what combination of encoder and decoder is relevant to our task and, subsequently, analyzed different types of normalization in the decoder.
Backbone and Decoder
Latest advances in image segmentation demonstrate that transfer learning is helpful in image segmentation. As such, the use of encoders pre-trained on ImageNet[deng2009imagenet] is a core technique in all the current state-of-the-art segmentation networks [kirillov2019panoptic]. In our study, we investigated multiple pre-trained models with state-of-the-art decoders, namely U-Net [ronneberger2015u], FPN [lin2017feature], LinkNet [linknet] and PSPNet [zhao2017pspnet] (Tab. 1).
Our experiments demonstrate that in lung segmentation, ResNet101 and ResNet50 with FPN decoder yielded the best and second best results, respectively (Tab. 0(a)). In heart segmentation, Resnet50 backbone outperformed Resnet101 and both of the encoders with FPN decoder achieved best and second best results, respectively. Therefore, for simplicity and speed of computations, we selected Resnet50 with FPN to be our main configuration.
Normalization in the Decoder
Once the best configuration has obtained, we assessed the influence of normalization in the decoder. In particular, we investigated whether replacement of batch normalization to group or instance normalization has any effect on the performance of our model. These experiments are presented in Tab. 2. The results demonstrate that instance normalization achieves better performance compared to other normalization types.
4.4 Test Set Results
Optimal Number of Monte-Carlo Samples
In our experiments, we assessed the influence of MC dropout onto the performance of our segmentation and CTR estimation pipeline. As such, we computed the aggregated IoU values for heart and lungs ground truth masks. Besides, we also computed the Pearson’s correlation of CTR computed using the ground truth and the predicted masks for different number of MC samples. These results are visualized in Fig. 5. From this plot it can be seen that the optimal number of iterations on all datasets with respect to IoU and CTR correlations is close to . We use this number for our further evaluation of the developed method.
For the optimal number of MC samples (20 according to Fig. 5), we performed the quantitative evaluation of our model on all the test datasets, namely ChestXray14, JSRT, Shenzhen and Montgomery. Besides computing the IoU, similarly to the previous paragraph, we also computed the Pearson’s correlation between the CTR values for manually annotated masks and the predictions. These results are presented in Tab. 3.
In Fig. 6, we visualized the examples of segmentation and the uncertainty estimates (both aleotoric and epistemic). The proposed method achieves accurate results demonstrating good segmentation performance in general. However, we note that our model does not predict the sharp corners of the lungs. Furthermore, from the epistemic uncertainty maps, it can be seen that our model is not confident in the bottom part of the lungs as they are typically very difficult to annotate since they can be hardly distinguished in the images.
In this paper, we developed a novel approach for automatic segmentation of chest X-rays and assessment of CTR. Our approach is a modified FPN with ResNet50 backbone and MC dropout. In the extensive experimental evaluation, we found that the proposed configuration with instance normalization in the decoder yielded the best results compared to other investigated network configurations. Besides, it is worth to note that for the first time in CTR estimation realm, we proposed to assess it using Bayesian deep learning.
In this paper, we focused not only on developing state-of-the-art method for segmenting the chest X-rays, but also tackled the issue of annotation of these data and the availability of reliably annotated train and test data. As such, we proposed multiple new datasets that were annotated by radiologists and we plan to publicly release these data to facilitate further research.
Despite the advantages of our proposed method, this study has still some limitations. In particular, we did not experiment with training the models from scratch and used transfer learning. The second limitation of our study is that we did not compare our method to state-of-the-art unsupervised domain adaptation approaches [dong2018unsupervised, chen2018semantic, eslami2019image]. However, this would require re-implementation of previously presented methods as our annotations for all the test set differ from all the previously published techniques. We leave this limitation for the future work. Another important limitation of our study is that the annotators of the test data differ. In particular, one radiologist (radiologist A) annotated the train and the test sets derived from ChestXray14 dataset [Wang_2017_CVPR] and another radiologist (radiologist B) annotated the images from JSRT [shiraishi2000development], Montgomery and Shenzhen datasets [jaeger2014two]. While we think that this particular limitation has insignificant impact onto our results, we still plan assess the inter-rater agreement between the annotators of the data.
To conclude, this paper introduced a novel, more challenging setting for segmenting organs in chest X-rays and proposed a Bayesian modification of FPN that allowed to estimate the CTR with the uncertainty bounds using MC-dropout. We think that the proposed approach has multiple applications in the clinical practice, as such, it could be useful for quantitative monitoring of CTR for patients with cardiomegaly in intensive care units. Another interesting application is the image quality assessment since our model is able to predict the aleotoric uncertainty for every test image. Finally, for the benefit of the community, we publicly release our dataset, implementation of our method and the pre-trained models at http://will.be.placed.after.review.