Medical image segmentation is the task of labeling each pixel of an object of interest in medical images. It is often a key task for clinical applications, varying from Computer Aided Diagnosis (CADx) for lesions detection to therapy planning and guidance . Medical image segmentation helps clinicians focus on a particular area of the disease and extract detailed information for a more accurate diagnosis. The key challenges associated with medical image segmentation are the unavailability of a large number of annotated, lack of high-quality labeled images for training , low image quality, lack of a standard segmentation protocol, and a large variations of images among patients 
. The quantification of segmentation accuracy and uncertainty is essential to estimate the performance on other applications. This indicates the requirement for an automatic, generalizable, and efficient semantic image segmentation approach.
Convolutional Neural Networks have shown state-of-the-art performance for automated medical image segmentation . For semantic segmentation tasks, one of the earlier Deep Learning (DL) architecture trained end-to-end for pixel-wise prediction is a Fully Convolutional Network (FCN). U-Net 
is another popular image segmentation architecture trained end-to-end for pixel-wise prediction. The U-Net architecture consists of two parts, namely, analysis path and synthesis path. In the analysis path, deep features are learned, and in the synthesis path, segmentation is performed based on the learned features. Additionally, U-Net uses skip connections operation. The skip connection allows propagating dense feature maps from the analysis path to the corresponding layers in the synthesis part. In this way, the spatial information is applied to the deeper layer, which significantly produces a more accurate output segmentation map. Thus, adding more layers to the U-Net will allow the network to learn more representative features leading to better output segmentation masks.
Generalization, i.e., the ability of the model to perform in an independent dataset, and robustness, i.e., the ability of the model to perform on challenging images, are keys for the development of Artificial Intelligence (AI) system to be used in clinical trials . Therefore, it is essential to design a powerful architecture that is robust and generalizable across different biomedical applications. Pre-trained ImageNet  models have significantly improved the performance of the CNN architectures. One of the examples of such models trained on ImageNet is VGG19 . Inspired by the success of U-Net and its variants for medical image segmentation, we propose an architecture that uses modified U-Net and VGG-19 in the encoder part of the network. Because we use two U-Net architectures in the network, we term the architecture as DoubleU-Net. The main reasons for using the VGG network are: (1) VGG-19 is a lightweight model as compared to other pre-trained models, (2) the architecture of VGG-19 is similar to U-Net, making it easy to concatenate with U-Net, and (3) it will allow much deeper networks for producing better output segmentation mask. Thus, we aim to improve the overall segmentation performance of the network by enabling this architectural changes.
The main contributions of this work are:
We propose a novel architecture, DoubleU-Net, for semantic image segmentation. The proposed architecture uses two U-Net architecture in sequence, with two encoders and two decoders. The first encoder used in the network is pre-trained VGG-19 , which is trained on ImageNet . Additionally, we use Atrous Spatial Pyramid Pooling (ASPP) . The rest of the architecture is built from scratch.
Experiments on multiple datasets are prerequisites for showing the enhancement of the proposed algorithm over other algorithms. In this respect, we have experimented on four different medical imaging datasets, two different datasets from colonoscopy, one from dermoscopy, and one from microscopy. DoubleU-Net shows better segmentation performance as compared to baseline algorithms on MICCAI 2015 Segmentation challenge dataset, CVC-ClinicDB dataset, Lesion Boundary Segmentation challenge from ISIC-2018, and 2018 Data Science Bowl challenge dataset.
An extensive evaluation of DoubleU-Net across four dataset shows a significant improvement over U-Net. Therefore, DoubleU-Net can be a new baseline for medical image segmentation task.
The paper is organized into seven sections. Section II provides an overview of the related work in the field of medical image segmentation. In Section III, we describe the proposed architecture. Section IV describes the experiments. Section V presents the results obtained from the experimental evaluation on different datasets. A discussion of the work is provided in Section VI. Finally, we summarize the paper and discuss future work and limitations in Section VII.
Ii Related Work
Among different CNN architectures, an encoder-decoder network like FCN  and its extension U-Net  have gained significant popularity among semantic segmentation approach for D images. Badrinarayan et al.  proposed a deep fully CNN for semantic pixel-wise segmentation that has significantly fewer parameters and produces good segmentation maps. Yu et al.  proposed a new convolutional network module that particularly targeted dense prediction problems. The proposed module used dilated convolutions for systematically aggregating multi-scale contextual information, and the presented context module improved the accuracy for state-of-the-art semantic image segmentation systems.
Chen et al.  proposed DeepLab to solve segmentation problem. Later, DeeplabV3  significantly improved over their previous DeepLab versions without DenseCRF post-processing. The DeepLabV3 architecture uses a synthesis path that contains the fewer number of convolutional layers that are different from the synthesis path of FCN and U-Net. DeepLabV3 uses skip connection between analysis path and synthesis path similar to U-Net architecture. Zhao et al. 
proposed effective scenes parsing network for complex scene understanding, where global pyramidal features provide an opportunity to capture additional contextual information. Zhang et al.
proposed Deep Residual U-Net, which uses residual connections better output segmentation map. Chen et al. proposed Dense-Res-Inception Net (DRINET) for medical image segmentation and compared their results with FCN, U-Net, and ResUNet. Ibtehaz et al.  modified U-Net and proposed an improved MultiResUNet architecture for medical image segmentation where they compared their results with U-Net on various medical image segmentation datasets and showed superior accuracy than U-Net.
Jha et al.  proposed ResUNet++, which is an enhanced version of standard ResUNet by integrating an additional layer such as squeeze-and-excite block, ASPP
, and attention block to the network. The proposed architecture uses dice loss as the loss function and produces an improved output segmentation maps as compared to U-Net and ResUNet on the Kvasir-SEG and CVC-ClinicDB  datasets. Zhou et al.  proposed UNet++, a neural network architectures for semantic and instance segmentation tasks. They improved the performance of UNet++ by alleviating the unknown network depth, redesigning the skip connections, and devising a pruning scheme to the architecture.
From the above-related work, we can observe that there has been substantial efforts toward developing deep CNN architectures for the segmentation of both natural and medical images. Recently, more works are focused on developing generalizable models, which is why most of the researchers test their algorithms on different datasets [13, 15, 34]. The accuracy achieved is now is high for both natural imaging  and medical imaging[13, 34, 15]. However, AI in medicine is still an emerging field. One of the significant challenges in the medical domain is the lack of test datasets. Moreover, the obtained datasets are often imbalanced. To some extent, we can say that the performance is acceptable in the case of natural images. In the medical imaging, there are many challenging images (for example, flat polyps in colonoscopy), which are usually missed out during colonoscopy examination and can develop into cancer if early detection is not performed. Therefore, there is a need for a more accurate medical image segmentation approach to deal with the challenging images. Toward addressing this need, we have proposed DoubleU-Net architecture that produces efficient output segmentation masks with the challenging images.
Iii The DoubleU-Net Architecture
Figure 1 shows an overview of the proposed architecture. As seen from the figure, DoubleU-Net starts with a VGG-19 as encoder sub-network, which is followed by decoder sub-network. What distinguishes DoubleU-Net from U-Net in the first network (NETWORK ) is the use of VGG-19 marked in yellow, ASPP marked in blue, and decoder block marked in light green. The squeeze-and-excite block  is used in the encoder of NETWORK and decoder blocks of NETWORK and NETWORK . An element-wise multiplication is performed between the output of NETWORK with the input of the same network. The difference between DoubleU-Net and U-Net in the second network (NETWORK ) is only the use of ASPP and squeeze-and-excite block. All other components remain the same.
In the NETWORK , the input image is fed to the modified U-Net, which generates a predicted mask (). We then multiply the input image and the produced mask (), which acts as an input for the second modified U-Net that produces another mask (). Finally, we concatenate both the masks ( and ) to see the qualitative difference between the intermediate mask () and final predicted mask ().
We assume that the produced output feature map from NETWORK can still be improved by fetching the input image and its corresponding mask again, and concatenating with will produce a better segmentation mask than the previous one. This is the main motivation behind using two U-Net architectures in the proposed architecture. The squeeze-and-excite block in the proposed networks reduces the redundant information and passes the most relevant information. ASPP has been a popular choice for modern segmentation architecture because it helps to extract high-resolution feature maps that lead to superior performance .
Iii-a Encoder Explanation
The first encoder in DoubleU-Net () uses pre-trained VGG-19, whereas the second encoder (), is built from scratch. Each encoder tries to encode the information contained in the input image. Each encoder block in the performs two
convolution operation, each followed by a batch normalization. The batch normalization reduces the internal co-variant shift and also regularizes the model. ARectified Linear Unit (ReLU
) activation function is applied, which introduces non-linearity into the model. This is followed by a squeeze-and- excitation block, which enhances the quality of the feature maps. After that, max-pooling is performed with a
window and strideto reduce the spatial dimension of the feature maps.
Iii-B Decoder Explanation
As shown in Figure 1, we use two decoders in the entire network, with small modifications on the decoder as compared with that of the original U-Net. Each block in the decoder performs a bi-linear up-sampling on the input feature, which doubles the dimension of the input feature maps. Now, we concatenate the appropriate skip connections feature maps from the encoder to the output feature maps. In the first decoder, we only use skip connection from the first encoder, but in the second decoder, we use skip connection from both the encoders, which maintains the spatial resolution and enhance the quality of the output feature maps. After concatenation, we again perform two convolution operation, each of which is followed by batch normalization and then by a ReLU activation function. After that, we use a squeeze and excitation block. At last, we apply a convolution layer with a sigmoid activation function, which is used to generate the mask for the corresponding modified U-Net.
In this section, we present datasets, evaluation metrics, experiment setup and configuration, and data augmentation techniques used in all the experiments to validate the proposed framework.
To evaluate the effectiveness of the DoubleU-Net, we have used four publicly available datasets from medical domain.
Similarly, CVC-ClinicDB has been a common choice for polyp segmentation. Therefore, we use this dataset for comparison.
More information about the datasets are presented in Table I. All of the datasets are clinically relevant during diagnosis, and therefore, their segmentation can be crucial for patient outcome.
|Dataset||No. of Images||Input size||Application|
|MICCAI 2015 segmentation challenge||808||Colonoscopy|
|Lesion Boundary Segmentation challenge||2594||Variable||Dermoscopy|
|2018 Data Science Bowl Challenge||Nuclei|
Iv-B Evaluation metrics
DoubleU-Net is evaluated on the basis of Sørensen–dice coefficient (DSC), mean Intersection over Union (mIoU
), Precision, and Recall. We evaluate all of these metrics for all four datasets. However, we compare and emphasize more on the official evaluation metrics that were used in the challenge. For example, the official evaluation metrics for the Lesion Boundary Segmentation challenge ismIoU.
Iv-C Experiment setup and configuration
All models are implemented using Keras framework
with Tensorflow as backend. The implementation can be found at our GitHub repository.333https://github.com/DebeshJha/2020-CBMS-DoubleU-Net We ran our experiments on a Volta 100 GPU and an Nvidia DGX-2 AI system. In all of the datasets, we used 80% of dataset for training, 10% for validation, and 10% for testing. During training, we used the original image size for the smaller dataset, such as CVC-ClinicDB and Nuclei segmentation dataset, and resized the images to for the Lesion Boundary segmentation challenge dataset to balance between training time and complexity. The size of ETIS-Larib was adjusted similarly to that of CVC-ClinicDB. We use binary cross-entropy as the loss function for all the networks and the Nadam optimizer with its default parameters. For the lesion boundary segmentation dataset and the Nuclei segmentation dataset, where dice loss and Adam optimizer performed slightly higher, the batch size is set to and the learning rate to . All models are trained for epochs. Early stopping and ReduceLROnPlateau is also used.
Iv-D Data augmentation techniques
Medical datasets are challenging to obtain and annotate . Most existing datasets have only a few samples, which makes training DL models on these datasets challenging. One potential solution to the challenge of data insufficiency, is to use data augmentation techniques that increase the number of samples during training. For this, we first split the dataset into training, validation, and testing sets. We then apply different data augmentation methods to each set, including center crop, random rotation, transpose, elastic transform, etc. More details about the augmentation techniques we used can be found in our GitHub repository. A single image was converted into different images; thus, in total, images including the original image. The same augmentation techniques were applied to all four datasets.
|Mask R-CNN with Resnet101 ||0.7042||0.6124||-||-|
|Fully Convoutional Network ||-||-||0.7732||0.8999|
|Multi-scale patch-based CNN ||0.8130||-||0.7860||0.8090|
|MultiResUNet with data augmentation ||-||0.8497||-||-|
|Conditional generative adversarial network||0.8848||0.8127||-||-|
In this section, we present the results and compare them with the baselines on the respective datasets. U-Net is still considered as the baseline for various medical image segmentation tasks. Therefore, we compare the proposed model with U-Net by using the same data augmentation techniques as described above to demonstrate its effectiveness. We also report the results on four datasets and show the qualitative results to prove the usefulness of DoubleU-Net. In all of the figures demonstrating the qualitative results, the sequence of input, ground truth, , and are followed, where and are the intermediate and final output respectively.
V-a Comparison on MICCAI 2015 segmentation dataset
Our quantitative results on the MICCAI 2015 segmentation challenge dataset are summarized in Table II. The experimental results shows that DoubleU-Net achieved a DSC of and a mIoU of . From Table II, we can see that DoubleU-Net outperforms the baseline  by in terms of DSC and in mIoU. From the above table, we can also observe that the model that uses a pre-trained ImageNet network (for instance, Resnet101 or VGG-16) as a backbone achieves a higher score on cross-dataset evaluation as compared to that of training a network from scratch (see Table II). The visual results of the proposed model can be seen in Figure 2. From the visual analysis, we can observe that the segmentation mask produced by is better than that of . This also justifies the significance of the proposed model over U-Net.
V-B Comparison on CVC-ClinicDB
DoubleU-Net is compared with U-Net and the recent works that used the same dataset for evaluation. Table III shows the results on CVC-ClinicDB dataset. The evaluation results shows that DoubleU-Net achieve a DSC of which is higher than  and mIoU of , which is higher than . A careful visual analysis of the result shows that DoubleU-Net produces better segmentation masks as compared to the intermediate network. The model performs reasonably well on the challenging images such as flat and small polyps, which are usually missed-out during colonoscopy examinations (see Figure 3).
V-C Comparison on Lesion Boundary segmentation challenge dataset
The official evaluation metric for the challenge was mIoU. DoubleU-Net achieve a DSC of and mIoU of on this challenge dataset. From the quantitative results comparison (see Table IV), we can see that the DoubleU-Net outperforms U-Net  by an approximate margin of , and Multi-ResUNet  by an approximate margin of in terms of mIoU on Lesion boundary segmentation challenge dataset from ISIC-2018. Figure 4 shows the qualitative results. From the figure, we can see that both intermediate output and the final output produced by the network perform well on all types of lesions ranging from small to medium to large lesions. However, a careful analysis shows that the final output produced by the network is better than the intermediate one.
V-D Comparison on 2018 Data Science Bowl challenge dataset
Table V and Figure 5 presents the quantitative and qualitative results on 2018 Data Science Bowl challenge dataset. We have compared our work with U-Net++ . Our method produced a DSC of , which is higher than the method proposed by Zhou et. al , and comparable mIoU with U-Net and UNet++ that uses Resnet101 as the backbone model. UNet++ has been used as a strong baseline for result comparison over various image segmentation tasks. Therefore, the DoubleU-Net set a new baseline for semantic image segmentation task.
|Modality||U-Net (DSC)||DoubleU-Net (DSC)||Overall Improvement|
|Colonoscopy (MICCAI 2015)|
|Microscopy (2018 Data Science Bowl )|
Table VI shows the DSC comparison of U-Net and DoubleU-Net. From the above table, we can see that DoubleU-Net performs reasonably well as compared to U-Net for all the presented datasets. For the CVC-ClinicDB dataset, the performance of U-Net is competitive. However, for MICCAI 2015 segmentation challenge dataset and the 2018 Data Science Bowl, DoubleU-Net has a significant DSC improvement of and respectively. Additionally, the MICCAI 2015 challenge segmentation dataset provides us the opportunity to study the cross-data generalizability, which is critical in the medical domain . The generalization test showed that DoubleU-Net outperforms its competitors (see Table II). From the Table, we observe that the model trained on pre-trained ImageNet  performs much better on the cross-dataset test than that of the model trained from scratch. We have trained U-Net on the CVC-ClinicDB dataset, which is competitive with DoubleU-Net when tested on the same dataset (see Table III). The same model was used to test against the ETIS-Larib dataset. However, the performance of the U-Net was poor as compared to that of DoubleU-Net (see Table II). This fact suggests that DoubleU-Net is more generalizable and can be used for the cross-dataset test across the different domains.
From the qualitative results, we can see that DoubleU-Net is capable of producing better segmentation mask even for the challenging images. This can be observed from Figure 2 and Figure 3. Moreover, Figure 4 and Figure 5 show that the model produces high-quality segmentation masks for Lesion Boundary Segmentation challenge dataset and 2018 Data Science Bowl challenge dataset. The overall qualitative result shows that the model performs well for different multi-organ and multi-centered medical image segmentation datasets. Thus, the above results suggest that the robustness of the proposed model.
From the above experiments, we observed that the transfer learning from a pre-trained ImageNet network significantly improves the results on every dataset, which tries to compensate for the lack of enough training data. The qualitative and quantitative results suggest using DoubleU-Net as a baseline for result comparisons over four medical image segmentation datasets.
In this paper, we have proposed a novel CNN architecture called DoubleU-Net. The DoubleU-Net has five main components, namely two U-Net networks, VGG-19, a squeeze-and-excite block and ASPP. The performance of DoubleU-Net is significantly better when compared with the baselines and U-Net on all four datasets.
Moreover, the proposed architecture is flexible, and that makes it possible to integrate other CNN blocks into DoubleU-Net. We believe that the segmentation results can be improved by further integrating different CNN blocks and by the use of post-processing techniques such as conditional random field and Otsu threshold.
In the future, we plan to research building one model for different medical image segmentation tasks and focus on simplifying the architecture while retaining its ability to produce high segmentation masks. A limitation of the DoubleU-Net is that it uses more parameters as compared to U-Net, which leads to an increase in the training time. In the future, the research should focus more on designing simplified architectures with fewer parameters while maintaining its ability.
This work is funded in part by Research Council of Norway project number 263248 (Privaton). The computations in this paper were performed on equipment provided by the Experimental Infrastructure for Exploration of Exascale Computing (eX3), which is financially supported by the Research Council of Norway under contract 270053.
Tensorflow: a system for large-scale machine learning. In Proceeding of USENIX Symposium on Operating Systems Design and Implementation (OSDI), pp. 265–283. Cited by: §IV-C.
-  (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence 39 (12), pp. 2481–2495. Cited by: §II.
-  (2020) A multi-scale patch-based deep learning system for polyp segmentation. In Advanced Computing and Systems for Security, pp. 109–119. Cited by: TABLE III.
-  (2015) WM-dova maps for accurate polyp highlighting in colonoscopy: validation vs. saliency maps from physicians. Computerized Medical Imaging and Graphics 43, pp. 99–111. Cited by: §II, 1st item.
-  (2017) Fully convolutional neural networks for polyp segmentation in colonoscopy. In Medical Imaging 2017: Computer-Aided Diagnosis, Vol. 10134, pp. 101340F1 – 101340F1. Cited by: TABLE II.
-  (2018) DRINet for medical image segmentation. IEEE transactions on medical imaging 37 (11), pp. 2453–2462. Cited by: §II.
-  (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §II, §II.
-  (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: 1st item, §II.
-  (2015) Keras. Cited by: §IV-C.
-  (2018) Skin lesion analysis toward melanoma detection: a challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic). In IEEE International Symposium on Biomedical Imaging (ISBI), pp. 168–172. Cited by: 3rd item.
-  (2009) Imagenet: a large-scale hierarchical image database. In , pp. 248–255. Cited by: 1st item, §I, §VI.
-  (2018) Squeeze-and-excitation networks. In Proceedings of computer vision and pattern recognition (CVPR), pp. 7132–7141. Cited by: §III.
-  (2020) MultiResUNet: rethinking the u-net architecture for multimodal biomedical image segmentation. Neural Networks 121, pp. 74–87. Cited by: §II, §II, TABLE III, TABLE IV, §V-B, §V-C.
-  (2020) Kvasir-seg: a segmented polyp dataset. In International Conference on Multimedia Modeling (MMM), pp. 451–462. Cited by: §I, §II, §IV-D.
-  (2019) ResUNet++: an advanced architecture for medical image segmentation. In Proceeding of IEEE International Symposium on Multimedia (ISM), pp. 225–2255. Cited by: §II, §II, §III.
-  (2015) GPSSI: gaussian process for sampling segmentations of images. In Proceeding of International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 38–46. Cited by: §I.
-  (2017) Colorectal polyp segmentation using a fully convolutional neural network. In Proceeding of International Congress on Image and Signal Processing, BioMedical Engineering and Informatics, pp. 1–5. Cited by: TABLE III.
-  (2017) A survey on deep learning in medical image analysis. Medical image analysis (MedIA) 42, pp. 60–88. Cited by: §I.
-  (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 3431–3440. Cited by: §II.
Colorectal segmentation using multiple encoder-decoder network in colonoscopy images.
Proceeding of International Conference on Artificial Intelligence and Knowledge Engineering, pp. 208–211. Cited by: TABLE III.
-  (2019) Polyp segmentation using generative adversarial network. In International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 7201–7204. Cited by: TABLE III, §V-B.
-  (2019) Polyp detection and segmentation using mask r-cnn: does a deeper feature extractor cnn always perform better?. In Proceeding of International Symposium on Medical Information and Communication Technology (ISMICT), pp. 1–6. Cited by: TABLE II, §V-A.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In Proceedings of International Conference on Medical image computing and computer-assisted intervention (MICCAI), pp. 234–241. Cited by: §I, §II.
-  (2020) Robust medical instrument segmentation challenge 2019. arXiv preprint arXiv:2003.10299. Cited by: §I.
-  (2014) Toward embedded detection of polyps in wce images for early diagnosis of colorectal cancer. International Journal of Computer Assisted Radiology and Surgery 9 (2), pp. 283–293. Cited by: 1st item.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: 1st item, §I.
-  (2020) An extensive study on cross-dataset bias and evaluation metrics interpretation for machine learning applied to gastrointestinal tract abnormality classification. arXiv preprint arXiv:2005.03912. Cited by: §VI.
-  (2018) The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data 5, pp. 180161. Cited by: 3rd item.
-  (2018) Development and validation of a deep-learning algorithm for the detection of polyps during colonoscopy. Nature biomedical engineering 2 (10), pp. 741–748. Cited by: TABLE III.
-  (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122. Cited by: §II.
-  (2018) Road extraction by deep residual u-net. Geoscience and Remote Sensing Letters 15 (5), pp. 749–753. Cited by: §II.
-  (2013) An overview of interactive medical image segmentation. Annals of the BMVA 2013 (7), pp. 1–22. Cited by: §I.
-  (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 2881–2890. Cited by: §II.
-  (2019) UNet++: redesigning skip connections to exploit multiscale features in image segmentation. IEEE Transactions on Medical Imaging. Cited by: §II, §II, §V-D, TABLE V.