Bayesian Feature Pyramid Networks for Automatic Multi-Label Segmentation of Chest X-rays and Assessment of Cardio-Thoratic Ratio

08/08/2019 ∙ by Roman Solovyev, et al. ∙ 8

Cardiothoratic ratio (CTR) estimated from chest radiographs is a marker indicative of cardiomegaly, the presence of which is in the criteria for heart failure diagnosis. Existing methods for automatic assessment of CTR are driven by Deep Learning-based segmentation. However, these techniques produce only point estimates of CTR but clinical decision making typically assumes the uncertainty. In this paper, we propose a novel method for chest X-ray segmentation and CTR assessment in an automatic manner. In contrast to the previous art, we, for the first time, propose to estimate CTR with uncertainty bounds. Our method is based on Deep Convolutional Neural Network with Feature Pyramid Network (FPN) decoder. We propose two modifications of FPN: replace the batch normalization with instance normalization and inject the dropout which allows to obtain the Monte-Carlo estimates of the segmentation maps at test time. Finally, using the predicted segmentation mask samples, we estimate CTR with uncertainty. In our experiments we demonstrate that the proposed method generalizes well to three different test sets. Finally, we make the annotations produced by two radiologists for all our datasets publicly available.



There are no comments yet.


page 1

page 4

page 5

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Heart failure (HF) is highly prevalent in different populations. As such, its prevalence varies from 1% to 14% of the population according to the data from Europe and the United States [dunlay2017epidemiology]. One clinical factor having impact on the diagnosis of HF is cardiomegaly [dunlay2017epidemiology] which is a condition affecting heart enlargement. In clinical practice, assessment of cardiomegaly is trivial to a human expert (radiologist) and typically done by a visual assessment. However, there are multiple clinical scenarios when the radiologist is not available, for example in emergency care or intensive care units.

Clinically accepted quantitative measure of cardiomegaly is cardiothoratic index (CTI) – a ratio of the heart’s and the lungs’ widths. In the literature, CTI is also often called a cardiothoratic ratio (CTR). CTR can be measured from chest radiographs which constitute over a half of radiographic imaging done in clinical practice [dai2018scan].

Multiple recent studies have demonstrated promising results in assessing chest and other radiographs by applying Deep Learning (DL) [tiulpin2019multimodal, tiulpin2018automatic, wang2019grey]. These efforts indicate a possibility of reducing the amount of human effort needed for visual image assessments. Ultimately, this technology has potential to reduce the health care costs while keeping the same quality of diagnosis [saba2019present].

DL is a methodology of learning hierarchical representations directly from data [schmidhuber2015deep]

. Typically, these representations (features) are learned with respect to the task, such as image classification or segmentation. The latter allows to classify image pixels individually and eventually obtain the locations and boundaries of the objects within an image. DL-based image segmentation was shown to be a core technique in assessing CTR from chest X-rays 

[dong2018unsupervised, li2019automatic]. However, none of the existing CTR assessment or chest X-ray segmentation methods allow to obtain the model uncertainty which is crucial in clinical practice.

In this paper, we propose a robust Bayesian segmentation-based method for CTR estimation which predicts pixel-wise class labels with a measure of model uncertainty. Our approach is based on Feature Pyramid Network (FPN) [lin2017feature, seferbekov2018feature] with Resnet-50 backbone [he2016deep] and instance normalization in the decoder [ulyanov2016instance]. For uncertainty estimation, we follow [kendall2015bayesian] and utilize Monte-Carlo (MC) dropout at test time. Schematically, the proposed approach is illustrated in Fig 1.

The main contributions of this paper are:

  1. We extend traditional DL-based methods for CTR estimation to Bayesian neural network which can predict pixel-wise class labels and uncertainty bounds from segmentation masks.

  2. Compared to all the previous studies, we propose a challenging training dataset with diverse radiological findings annotated by a radiologist.

  3. The model evaluation is performed on 3 widely-used public X-ray image datasets which were re-annotated in a similar way to our training dataset, but come from different scanners and hospitals.

  4. To the best of our knowledge, this is the first work that uses Bayesian DL in both chest X-ray segmentation and CTR estimation domains. Our methodology allows to assess the uncertainty of the model at test time, thereby providing clinical value in potential applications.

  5. Finally, we publicly release the annotations and the training dataset utilized in this study. We think that these data could set up a new, more challenging benchmark in chest X-ray segmentation.

2 Related Work

Chest X-ray Segmentation.

The most relevant studies to ours are by Dong et al [dong2018unsupervised], by Dai et al. [dai2018scan] and also by Elsami et al. [eslami2019image]. They introduced adversarial training to enforce the consistency between the predictions and the ground truth annotations. Both studies explore the same methodology while the former one is mainly focused on CTR estimation and uses the adversarial training for unsupervised domain adaptation (UDA), the latter is rather targeting segmentation of chest X-ray. The methods demonstrate that better generalization performance to unseen data can be achieved by using adversarial training.

Besides CTR estimation realm, there are other studies approaching the segmentation problem of chest X-ray images by applying DL. Arbabshirani et al. [arbabshirani2017accurate] and recent works [chen2018semantic, souza2019automatic] demonstrated remarkable performance in lungs segmentation. Furthermore, Wessel et al. [wessel2019] utilized mask R-CNN [mask_rcnn] to successfully localize, segment and label individual ribs.

From the segmentation field point of view in general, there exist multiple studies that use FPN as a decoder for image segmentation [gao2018end, kirillov2019panoptic, rakhlin2019breast]. In particular, the study by Seferbekov et al. [seferbekov2018feature]

explores a very similar architecture to ours and seems to be the first to demonstrate a combination of ImageNet-pretrained Resnet-50 encoder with FPN decoder successfully applied to image segmentation.

In Bayesian segmentation, we note the study by Kendall et al. [kendall2015bayesian] that introduced the use of MC dropout for the uncertainty estimation in image segmentation. Furthermore, the recent study by [mukhoti2018evaluating] proposed to use a modification of DeepLab-v3+ [chen2018encoder] that allowed to achieve state-of-the-art segmentation results and at the same time obtain uncertainty estimates.

Limitations of the Existing Chest X-ray Segmentation Datasets.

In this paragraph, we also tackle an important issue of existing annotations and images in CTR assessment and Chest X-ray segmentation realm. In particular, all the existing DL-based CTR estimation aforementioned methods have been trained on the datasets that do not include the true boundaries of the anatomical structures within the chest X-rays. While this does not have significant impact on CTR estimation in general, the absence of the true boundaries of heart and lungs limits the scope of applications that can be built using the automatic Chest X-ray segmentation (e.g. detection of plural effusion). Moreover, the existing datasets were originally from tuberculosis (TB) domain which also limits the reliable testing of the segmentation and CTR assessment models. We argue that clinically applicable methods must be trained and tested on the datasets that have diverse radiological findings.

3 Method


The proposed method is based on a combination of several state-of-the-art techniques for image segmentation. In particular, our approach leverages the power of encoder pre-trained on ImageNet [deng2009imagenet], Feature Pyramid Networks-inspired decoder [lin2017feature, seferbekov2018feature], instance normalization [ulyanov2016instance] and MC dropout at test time [kendall2015bayesian]. The architecture of the proposed approach is illustrated in Fig. 2.

Figure 2:

Model architecture. Here, we proposed a simplistic modification of FPN for image segmentation. In particular, we inserted the dropouts before the second, third and the fourth residual blocks (marked in red). Besides, we also used the dropout in the FPN part of our model and used it before every upsampling layer in the feature pyramid (nearest neighbour). However, the dropout was not used before the upsampling layers that were followed by the concatenation of the feature maps. The decoder in our model used instance normalization in the convolutional blocks (yellow) and the final upsampling layer used a bi-linear interpolation.


We used a standard Resnet-50 pre-trained on ImageNet [deng2009imagenet]

. We do not freeze the encoder during training and merely use it as is from the beginning. It is worth to note that our pre-trained encoder follows a Batch Normalization (BN) layer that learns the mean of the dataset during the training. Furthermore, we inserted the dropouts with a probability

before the second, third and the fourth residual blocks of ResNet50 (see Fig. 2).


The decoder is a standard FPN. However, similarly to Seferbekov et al. [seferbekov2018feature], we do not use intermediate supervision and prediction at each layer of the feature pyramid as it is usually done for object detection [lin2017feature]. As the sizes of the feature pyramid and the input image do not match, we use nearest neighbours upsampling. In contrast to [seferbekov2018feature], we replace each BN layer in the decoder by instance normalization (IN) layer at every level of the feature pyramid.

In addition to dropout units in the backbone, we also apply them after each convolutional block of the feature pyramid as illustrated in Fig. 2. More specifically, the dropouts have been used in the decoder only before upsampling layers.

As the task of segmenting the chest X-ray structures is multi-label, rather than multi-class, the decoder has 2 outputs, where the first plane corresponds to the heart and the second one to the lungs. Before the final output layer we used a spatial dropout with a rate of .

Bayesian Segmentation Framework: Monte-Carlo Dropout.

As mentioned previously, we leverage MC-dropout technique [gal2016dropout, kendall2015bayesian]. To capture the model’s uncertainty, it is necessary to estimate the posterior distribution of the model’s weights given the train images and the corresponding labels . However, this distribution is intractable, therefore, its variational approximation  [mukhoti2018evaluating]. Gal and Ghahramani [gal2016dropout]

have shown that training a neural network with dropout and standard cross-entropy loss function leads to minimization of Kullback-Leiber (KL) divergence between

and :



is chosen to be a Bernoulli distribution.

In our experiments, we enabled the dropout layers in both encoder and FPN, respectively. We then performed the sampling of pixel-wise probability masks similarly to Kendall et al [kendall2015bayesian]. Here, for every pixel of the input image , having MC-dropout iterations, we generate the prediction at every MC dropout iteration and eventually estimate to produce the segmentation masks and for the heart and the lungs, respectively.

Bayesian Segmentation Framework: Aleotoric and Epistemic Uncertainties.

Besides the segmentation masks, the proposed framework also produced uncertainty estimates per pixel. As such, we computed both aleotoric and epistemic uncertainties. Briefly, the former one captures the inherent noise in the data (sensor noise) while the latter exhibits the model’s uncertainty. Both of these uncertainties are important as aleotoric uncertainty allows to estimate the need of improving the sensor precision and the epistemic uncertainty enables to assess the need of larger training dataset [mukhoti2018evaluating]. Similarly to Mukohti and Gal [mukhoti2018evaluating], we approximated the aleotoric uncertainty for the test examples given the train data as a predictive entropy :


and the epistemic uncertainty was approximated as a mutual information between the predictive distribution and the posterior over the weights of the model:


In our experiments, we estimated both and for each image (similarly, also and ). For visualization purposes, we displayed the summed entropy and mutual information masks, respectively.

CTR estimation

Following [dong2018unsupervised], we used the same method for CTR estimation. More specifically, CTR is defined as the ratio of the widest diameter of the lungs and the heart as illustrated in Fig. 3).

Figure 3: Visualization of the line segments used for CTR estimation.

Training of the model.

During the training, we used the combination of binary cross-entropy (BCE) and soft-Jaccard loss (J) as done in various other studies [seferbekov2018feature]:


where are the model’s weights, are the images and and are the ground truth masks for the heart and the lungs, respectively.

During the training process, we use various data augmentation techniques to improve the robustness and decrease possible overfitting. In particular, we used random-sized crop and degrees rotation as our main data augmentations (with a probability of per image). Noise addition, blur and and sharpening were used as secondary augmentations ( per image). Finally, elastic distortions, Random Brightness, JPEG compression and horizontal flips were used with a probability of to regularize the images with hard cases.

All our models were trained with the image resolution of pixels. To train the models, we used minibatch of , learning rate of and spatial dropout rate of

. Our experiments were conducted in Keras 

[chollet2015keras] with Albumentations library [buslaev2018albumentations] used for data augmentation.

(a) Ours
(b) JSRT
(c) Montgomery
(d) Schenzen
Figure 4: Original annotations in all the test datasets. In our experiments we re-annotated JSRT, Montgomery and Shenzhen datasets in a similar fashion to our dataset in order to have only the image distribution different, but the segmentation masks being annotated in exactly the same fashion.
[width=9em]DecoderEncoder vgg16 vgg19 ResNet18 ResNet34 ResNet50 ResNet101 ResNet152 SE-ResNet18 SE-ResNet34 SE-ResNet50 DenseNet121 MobileNetV1 MobileNetV2
[Simonyan14c] [he2016deep] [hu2018senet] [huang2017densely] [mobilenetv1] [Sandler2018MobileNetV2IR]
Unet [unet] 0.906 0.908 0.895 0.897 0.902 0.895 0.906 0.846 0.843 0.908 0.892 0.907 0.874
FPN [lin2017feature] 0.893 0.911 0.898 0.907 0.911 0.913 0.893 0.908 0.899 0.910 0.899 0.901 0.878
LinkNet [linknet] 0.905 0.907 0.861 0.893 0.860 0.861 0.875 0.874 0.821 0.904 0.858 0.880 0.874
PSPNet [zhao2017pspnet] 0.871 0.877 0.852 0.860 0.859 0.862 0.865 0.861 0.859 0.874 0.870 0.853 0.842
(a) IoU Lungs.
[width=9em]DecoderEncoder vgg16 vgg19 ResNet18 ResNet34 ResNet50 ResNet101 ResNet152 SE-ResNet18 SE-ResNet34 SE-ResNet50 DenseNet121 MobileNetV1 MobileNetV2
[Simonyan14c] [he2016deep] [hu2018senet] [huang2017densely] [mobilenetv1] [Sandler2018MobileNetV2IR]
Unet [unet] 0.843 0.838 0.805 0.731 0.822 0.805 0.820 0.714 0.750 0.848 0.793 0.791 0.750
FPN [lin2017feature] 0.814 0.786 0.799 0.836 0.865 0.863 0.806 0.819 0.806 0.814 0.849 0.787 0.766
LinkNet [linknet] 0.834 0.808 0.797 0.766 0.773 0.799 0.814 0.762 0.668 0.839 0.687 0.755 0.734
PSPNet [zhao2017pspnet] 0.764 0.814 0.740 0.776 0.717 0.712 0.745 0.703 0.741 0.779 0.781 0.654 0.661
(b) IoU Heart
Table 1: Ablation study. IoU metric (higher is better) computed on the proposed dataset for Lungs (Tab. 0(a)) and Heart (Tab. 0(b)) obtained by different encoder-decoder architectures. For each decoder we indicate the best model as bold and the second best model as underscore. The best combination of Encoder+Decoder is highlighted as gray. We chose ResNet50 model as a backbone network and FPN decoder for further experiments. See Sec. 4.3 for more details.
[width=12em]DecoderNormalization batch (BN) group (GN) instance (IN)
[batch_norm] [group_norm] [ulyanov2016instance]
Unet [unet] 0.902 0.886 0.911
LinkNet [linknet] 0.860 0.910 0.915
FPN [lin2017feature] 0.911 0.914 0.916
(a) Lungs.
[width=12em]DecoderNormalization batch (BN) group (GN) instance (IN)
[batch_norm] [group_norm] [ulyanov2016instance]
Unet [unet] 0.822 0.822 0.870
LinkNet [linknet] 0.773 0.828 0.862
FPN [lin2017feature] 0.865 0.868 0.884
(b) Heart.
Table 2: Selection comparison of different normalization techniques in decoders. Here, we did not experiment with PSPNet [zhao2017pspnet] decoder due to its low performance in backbone selection stage.

4 Experiments and Results

4.1 Dataset

Our training set was derived from ChestXray14 dataset [Wang_2017_CVPR]. The original dataset included chest radiographs from patients, while the train data used in this study comprised images randomly sampled from these data. For training, we used images images from distinct patients. images from patients were used for validation and the remaining images from patients were eventually used as a test set.

The proposed multi-label dataset has the following findings and labels (the number of samples for each label is given in parenthesis): Cardiomegaly (), Emphysema (), Effusion (), Hernia (), Infiltration (), Nodule (), Atelectasis ), Mass (), Pneumothorax (), Pneumonia (), Pleural thickening (), Fibrosis (), Consolidation (), Edema (). There are samples with no findings. Such label distribution makes our data more challenging for segmentation compared to the previous studies.

It is worth to mention that the original data were provided with the labels mined from the radiology reports, however the dataset did not have any segmentation masks. Our radiologist (radiologist A) annotated the train, validation and the test sets. An example of the annotations is presented in Fig. 4. Compared to the other existing datasets illustrated in the same figure, our annotations delineate the true anatomical contours which makes the segmentation more challenging.

4.2 Auxiliary Test Datasets

Re-annotation of the existing datasets

We evaluated our method on three auxiliary test sets each derived from three independent public datasets, respectively (see below). The original annotations for these datasets did not include the true boundaries of the lungs underlying the heart, or had missing annotations of the heart. To evaluate the performance of our method trained on our datasets with true lung boundaries, practicing radiologist B having the same experience to the radiologist A, annotated random images for the test evaluation from each of the auxiliary test datasets. The comparison between the original annotations in the test datasets and our annotations is presented in Fig. 4.


Japanese Society of Radiological Technology dataset was first released in [shiraishi2000development] and complemented by the annotations from [van2002automatic]. The dataset contains 247 radiographs in total, having 154 radiographs with and 93 radiographs without lung nodules, respectively.

Montgomery County X-ray Set.

This dataset contains chest radiographs acquired by the Department of Health and Human Services, Montgomery County, Maryland, USA [jaeger2014two]. Out of the total amount, subjects had and subjects did not have tuberculosis (TB).

Shenzhen Hospital X-ray Set.

Shenzhen dataset is similar to Montgomery dataset as it includes the data from subjects with and without TB. These data were acquired at Shenzhen No.3 Hospital in Shenzhen, Guangdong providence, China. The total dataset size was of images having images from patients with TB [jaeger2014two].

(a) IoU per organ.
(b) CTR correlations with the ground truth per dataset.
Figure 5: Graphical illustration of dependency between the performance on the test sets and the number of MC dropout samples. 95% intervals are also highlighted and very computed using bootstrapping.

4.3 Ablation study


This section describes the ablation study conducted on deterministic models. After selecting the best deterministic model, we inserted the dropout as shown in Fig. 2. We first investigated what combination of encoder and decoder is relevant to our task and, subsequently, analyzed different types of normalization in the decoder.

Backbone and Decoder

Latest advances in image segmentation demonstrate that transfer learning is helpful in image segmentation. As such, the use of encoders pre-trained on ImageNet 

[deng2009imagenet] is a core technique in all the current state-of-the-art segmentation networks [kirillov2019panoptic]. In our study, we investigated multiple pre-trained models with state-of-the-art decoders, namely U-Net [ronneberger2015u], FPN [lin2017feature], LinkNet [linknet] and PSPNet [zhao2017pspnet] (Tab. 1).

Our experiments demonstrate that in lung segmentation, ResNet101 and ResNet50 with FPN decoder yielded the best and second best results, respectively (Tab. 0(a)). In heart segmentation, Resnet50 backbone outperformed Resnet101 and both of the encoders with FPN decoder achieved best and second best results, respectively. Therefore, for simplicity and speed of computations, we selected Resnet50 with FPN to be our main configuration.

Normalization in the Decoder

Once the best configuration has obtained, we assessed the influence of normalization in the decoder. In particular, we investigated whether replacement of batch normalization to group or instance normalization has any effect on the performance of our model. These experiments are presented in Tab. 2. The results demonstrate that instance normalization achieves better performance compared to other normalization types.

Dataset Heart Lungs Pearson’s corr.
IoU Dice IoU Dice
ChestXray14 0.87 0.93 0.92 0.96 0.87
JSRT 0.82 0.90 0.87 0.93 0.95
Schenzhen 0.84 0.91 0.87 0.93 0.97
Montgomery 0.86 0.92 0.87 0.93 0.92
Table 3: Quantitative results for each of the datasets. Here, we present the IoU for the Lungs and the Heart and also the Pearson’s correlation between the ground truth CTR (computed from the manually annotated masks) and the predicted CTR (computed from the predicted segmentation masks).

4.4 Test Set Results

Optimal Number of Monte-Carlo Samples

In our experiments, we assessed the influence of MC dropout onto the performance of our segmentation and CTR estimation pipeline. As such, we computed the aggregated IoU values for heart and lungs ground truth masks. Besides, we also computed the Pearson’s correlation of CTR computed using the ground truth and the predicted masks for different number of MC samples. These results are visualized in Fig. 5. From this plot it can be seen that the optimal number of iterations on all datasets with respect to IoU and CTR correlations is close to . We use this number for our further evaluation of the developed method.

Quantitative Results

For the optimal number of MC samples (20 according to Fig. 5), we performed the quantitative evaluation of our model on all the test datasets, namely ChestXray14, JSRT, Shenzhen and Montgomery. Besides computing the IoU, similarly to the previous paragraph, we also computed the Pearson’s correlation between the CTR values for manually annotated masks and the predictions. These results are presented in Tab. 3.

Segmentation Examples

In Fig. 6, we visualized the examples of segmentation and the uncertainty estimates (both aleotoric and epistemic). The proposed method achieves accurate results demonstrating good segmentation performance in general. However, we note that our model does not predict the sharp corners of the lungs. Furthermore, from the epistemic uncertainty maps, it can be seen that our model is not confident in the bottom part of the lungs as they are typically very difficult to annotate since they can be hardly distinguished in the images.

(a) ChestXray14
(b) JSRT
(c) Montgomery
(d) Schenzhen
Figure 6: Examples of segmentation and uncertainty estimates for each of the test datasets (random examples are shown). Here, from left to right: original image, predicted segmentation mask, aleotoric and epistemic uncertainties, respectively.

5 Conclusion

In this paper, we developed a novel approach for automatic segmentation of chest X-rays and assessment of CTR. Our approach is a modified FPN with ResNet50 backbone and MC dropout. In the extensive experimental evaluation, we found that the proposed configuration with instance normalization in the decoder yielded the best results compared to other investigated network configurations. Besides, it is worth to note that for the first time in CTR estimation realm, we proposed to assess it using Bayesian deep learning.

In this paper, we focused not only on developing state-of-the-art method for segmenting the chest X-rays, but also tackled the issue of annotation of these data and the availability of reliably annotated train and test data. As such, we proposed multiple new datasets that were annotated by radiologists and we plan to publicly release these data to facilitate further research.

Despite the advantages of our proposed method, this study has still some limitations. In particular, we did not experiment with training the models from scratch and used transfer learning. The second limitation of our study is that we did not compare our method to state-of-the-art unsupervised domain adaptation approaches [dong2018unsupervised, chen2018semantic, eslami2019image]. However, this would require re-implementation of previously presented methods as our annotations for all the test set differ from all the previously published techniques. We leave this limitation for the future work. Another important limitation of our study is that the annotators of the test data differ. In particular, one radiologist (radiologist A) annotated the train and the test sets derived from ChestXray14 dataset [Wang_2017_CVPR] and another radiologist (radiologist B) annotated the images from JSRT [shiraishi2000development], Montgomery and Shenzhen datasets [jaeger2014two]. While we think that this particular limitation has insignificant impact onto our results, we still plan assess the inter-rater agreement between the annotators of the data.

To conclude, this paper introduced a novel, more challenging setting for segmenting organs in chest X-rays and proposed a Bayesian modification of FPN that allowed to estimate the CTR with the uncertainty bounds using MC-dropout. We think that the proposed approach has multiple applications in the clinical practice, as such, it could be useful for quantitative monitoring of CTR for patients with cardiomegaly in intensive care units. Another interesting application is the image quality assessment since our model is able to predict the aleotoric uncertainty for every test image. Finally, for the benefit of the community, we publicly release our dataset, implementation of our method and the pre-trained models at