Image segmentation is the process of delineating regions of interest in an image, which is one of the primary tasks in medical imaging. Identifying these regions provides numerous applications in the medical domain. Some of the applications include: segmentation of glands in histology images which is an indicator of cancer severity, segmentation of optic cup and disc in retinal fundus images which is used in glaucoma screening, segmentation of lung nodules in chest computed tomography which aids physicians in differentiating malignant lesions from benign lesions, and segmentation of polyp in colonoscopy images which helps in diagnosing cancer in its early stages [Litjens et al.(2017)Litjens, Kooi, Bejnordi, Setio, Ciompi, Ghafoorian, van der Laak, van Ginneken, and Sánchez].
Traditional approaches in image segmentation include active contours [Kass et al.(1988)Kass, Witkin, and Terzopoulos], normalized cuts [Shi and Malik(2000)] and random walk [Grady(2006)]. Recently, fully convolutional networks (FCNs) such as U-Net [Ronneberger et al.(2015)Ronneberger, Fischer, and Brox] have been shown to be highly suitable for the task of semantic segmentation of medical images for almost all modalities [Ker et al.(2018)Ker, Wang, Rao, and Lim] and have been shown to achieve better results than the traditional methods. The current trend in the research using U-Net mainly revolves around two approaches: the first approach focuses on modifying the U-Net architecture by adding residual, dense and multi-scale blocks [Yu et al.(2017)Yu, Chen, Dou, Qin, and Heng, Shankaranarayana et al.(2017)Shankaranarayana, Ram, Mitra, and Sivaprakasam],
the second approach focuses on modifying the loss functions by adding dice, jaccard coefficients to normal cross-entropy while fixing U-Net as the base network[Milletari et al.(2016)Milletari, Navab, and Ahmadi]
. These approaches have provided a significant improvement in segmentation. However, medical images require segmentations to be of greater precision. One of the reasons for segmentation to be of poor precision is that the region of interests usually occupy a small area in medical images, resulting in severe foreground-background class imbalance, which leads to imprecise segmentation. Also, the network lacks the knowledge of an object’s shape because of spatial information being lost in encoder through max-pooling which results in irregular segmentation. Recently multiple works[Sarker et al.(2018)Sarker, Rashwan, Akram, Banu, Saleh, Singh, Chowdhury, Abdulwahab, Romani, Radeva, and Puig, Yan et al.(2018)Yan, Yang, and Cheng], have addressed these issues. But such networks are not capable of explicitly learning the spatial information. To this end, we propose a novel architecture which is capable of learning both class and spatial information explicitly through a joint learning framework. There are two related works which are of our interest: 1) The network DCAN proposed by [Chen et al.(2016)Chen, Qi, Yu, and Heng] provided better segmentation results with help of contours. 2) The network proposed by [Tan et al.(2018)Tan, Zhao, Yan, Li, Metaxas, and Zhan] showed improvement over DCAN with the help of distance maps. Both these methods propose the use of FCNs with the architecture having single encoder block and two decoder blocks, where one of the decoders is dedicated for segmentation task and the other is dedicated for the auxiliary task. But the networks proposed have complicated architectures and have large number of parameters, leading to longer training time and inference time, while also requiring more compute resources.
In this paper, as the main contribution, we propose a minimalistic deep network for the task of joint shape learning and segmentation. The proposed architecture consists of significantly fewer parameters while maintaining the performance and in many cases even outperforming the previous methods. We also explore numerous ways in which the spatial information can be incorporated and study their effects on the performance. We conduct multiple experiments for two different kinds of medical images- for optic disc and cup segmentation from retinal color fundus images and polyp segmentation from endoscopic images and report state of the art results.
In this section, we first present the novel end-to-end multi-task architecture for improving semantic segmentation, which is capable of exploiting spatial and structural information along with the class information, while also keeping the number of parameters less. We then present the ways in which structural information was obtained to aid the network. Next, we explain how the network was trained to learn the class information and the spatial information using different loss functions.
2.1 Network Architecture
The proposed architecture is shown in Figure 1. The architecture is an FCN consisting of two components. The first component of the network is similar to U-Net [Ronneberger et al.(2015)Ronneberger, Fischer, and Brox]. The second component consists of parallel convolutional blocks for multi-task learning.
The first component has an encoder-decoder architecture with encoder providing a contracting path and decoder providing an expansive path. The encoder consists of repeated applications of convolutions with kernel size of 3x3 and stride 1, followed by a rectified linear unit (ReLU) activation and 2x2 max pooling with stride 2 for downsampling. Repeated application of filters doubles the number of feature map and halves the feature dimension at each step. The final convolution in the encoder is carried out with 4x4 max pooling with other elements remaining the same. In the decoder path, upsampling is done to feature map by an initial factor of 4, followed by repeated upsampling by a factor of 2. Each feature map in the decoder is concatenated with the corresponding feature map from the encoder. This concatenation helps in retaining the feature maps from different scales.
As shown in Figure 1
, the top path in second component is the classification branch responsible for estimating the segmentation mask while the bottom path is used for the auxiliary task and is to estimate either the contour map as a classification task or the distance map as a regression task. For mask and contour estimation, 3x3 convolution is applied to get 2 feature maps while for distance map, the same convolution is applied to get 1 feature map.
2.2 Capturing Structural Information
We explored multiple techniques in order to capture spatial and structural information. We harness the spatial information that is implicitly present in ground truth segmentation masks and we achieve the same in two ways: using the contours obtained from the segmentation map and using the euclidean distance transforms computed from the segmentation maps. For obtaining the contour map C, we first extract the boundaries of connected components based on the ground truth segmentation maps which are subsequently dilated using a disk filter of radius . We also explore various kinds of distance transform maps. Using distance map allows us to assign a value to each pixel in an image relative to the nearest boundary of segmentation map. This alleviates the pixel-wise class imbalances which arise in the segmentation maps. Thus for an image, we assign values to all the pixels with being the total number of pixels and being the distance of the pixel to the closest boundary of the mask. We propose to use three kinds of distance maps based on the values assigned to the pixels. for the first case , we assign positive distances for all the points outside the boundary and assign zero values for all the points inside the boundary or the mask region. for the second case , we assign positive distances for the points inside and outside boundary while having zero values for the pixels on the boundary. for the third case , we assign positive distances for the points outside and negative distances for the points inside the boundary and zero values for the points on the boundary. Figure 2 is a visualization of the distance maps in 2D and 3D form. We show that the choice of distance map is also an important factor which affects the model performance.
2.3 Loss function
The mask prediction is a classification task and Negative Log Likelihood (NLL) is used as a loss function. The mask prediction is regularized either by contour or distance map learning tasks. For the classification task of contour prediction, NLL is used as a loss function. For the regression task of distance map prediction, Mean Square Eror (MSE) is used as a loss function. The combined loss functions involving the mask-contour pair and the mask-distance pair are formulated below.
2.3.1 Contour constraint:
The loss term for using contour map as a constraint is given by
and denotes the pixel-wise classification error. is the pixel position in image space .
denotes the predicted probability for true label
after softmax activation function.denotes the predicted probability for true label after softmax activation function.
2.3.2 Distance constraint:
The loss term for using distance map as a constraint is give by
is from equation 2 and denotes the pixel-wise mean square error. is the estimated distance map after sigmoid activation function while is the ground-truth.
3 Experiments and Results
3.1 Dataset and Pre-processing
We use ORIGA dataset [Zhang et al.(2010)Zhang, Yin, Liu, Wong, Tan, Lee, Cheng, and
Wong] for the task of optic disc and cup segmentation. The dataset contains color fundus images along with the pixelwise annotations for the optic disc and the cup.
We obtain the final segmentation map by thresholding the output probabilities similar to the work in [Fu et al.(2018)Fu, Cheng, Xu, Wong, Liu, and Cao]. We then fit an ellipse on the segmentation outputs for both cup and disc.
We also use Polyp segmentation dataset from MICCAI 2018 Gastrointestinal Image ANalysis (GIANA) [Vázquez et al.(2017)Vázquez, Bernal, Sánchez, Fernández-Esparrach, López, Romero, Drozdzal, and Courville] for evaluating the models because polyp has large variations in terms of shape. The dataset consists of 912 images with ground truth masks. The dataset is randomized and split into 70% for training and 30% for testing. The images are center-cropped to square dimensions and resized to 256256 before usage.
3.2 Implementation Details
The models are implemented in PyTorch[Paszke et al.(2017)Paszke, Gross, Chintala, Chanan, Yang, DeVito, Lin, Desmaison, Antiga, and Lerer]
. Each model is trained for 150 epochs with Adam optimizer, with a learning rate to 1e-4 and batch size of 4. All experimentations are conducted with NVIDIA GeForce GTX 1060 with 6GB vRAM.
3.3 Results and Discussion
The metrics used for evaluating the performance of the network include Jaccard and Dice. The explanation of Jaccard and Dice can be found in Appendix A. Some denotations used in this section are Encoder (Enc), Decoder (Dec), Mask (M), Contour (C), Distance (D) and Parallel convolution block after U-Net (Conv). The results of the proposed networks (1Enc 1Dec + Conv MC and 1Enc 1Dec + Conv MD) are compared with the following combinations of networks and loss functions.
A network (1Enc 1Dec M) [Ronneberger et al.(2015)Ronneberger, Fischer, and Brox] with a single encoder and a decoder having NLL as loss function for mask estimation.
A network (1Enc 2Dec MC) [Chen et al.(2016)Chen, Qi, Yu, and Heng] with a single encoder and two decoders having NLL as loss function for both mask and contour estimation.
A network (1Enc 2Dec MD) [Tan et al.(2018)Tan, Zhao, Yan, Li, Metaxas, and Zhan] with a single encoder and two decoders having NLL as loss function for mask and MSE as loss function for distance estimation.
|1Enc 1Dec M [Ronneberger et al.(2015)Ronneberger, Fischer, and Brox]||0.8655||0.7712||0.9586||0.9215||0.8125||0.7323|
|1Enc 2Dec MC [Chen et al.(2016)Chen, Qi, Yu, and Heng]||0.8715||0.7803||0.9646||0.9324||0.8151||0.7391|
|1Enc 2Dec MD [Tan et al.(2018)Tan, Zhao, Yan, Li, Metaxas, and Zhan]||0.8723||0.7807||0.9665||0.9358||0.8283||0.7482|
|1Enc 1Dec + Conv MC (Ours)||0.8717||0.7798||0.9643||0.9318||0.8152||0.7383|
|1Enc 1Dec + Conv MD (Ours)||0.8721||0.7805||0.9662||0.9348||0.8291||0.7514|
|Architecture||Running time (ms)||No. of parameters|
|1Enc 1Dec M [Ronneberger et al.(2015)Ronneberger, Fischer, and Brox]||1.3131||7844256|
|1Enc 2Dec MC [Chen et al.(2016)Chen, Qi, Yu, and Heng]||1.8677||10978272|
|1Enc 2Dec MD [Tan et al.(2018)Tan, Zhao, Yan, Li, Metaxas, and Zhan]||1.8531||10977984|
|1Enc 1Dec + Conv MC (Ours)||1.3384||7844832|
|1Enc 1Dec + Conv MD (Ours)||1.3235||7844544|
From Table 1 it can be seen that the network 1Enc 2Dec MC and 1Enc 1Dec + Conv MC have similar results for cup, disc and polyp segmentation. Likewise, the network 1Enc 2Dec MD and 1Enc 1Dec + Conv MD have nearly equal results for the cup, disc and polyp segmentation. This shows that having a single decoder with two convolution paths achieves equivalent results to the network with two parallel decoders. This observation indicates that a single decoder in itself is sufficient to reconstruct features of mask, contour and distance from the encoder representation. This reasoning can be supported by visualizing the 32 decoder feature maps obtained before parallel convolution blocks. From Figure 3, it can be seen that feature maps 1 and 16 represent the approximation of mask and contour respectively. Similarly, the distance maps can also be obtained through a linear combination of feature maps. The remaining feature maps can be viewed in Appendix B. This validates our claim that only a few convolutions are required post the final decoder layer for obtaining the contour as well as distance maps.
Also, because of using a single decoder network, the number of parameters involved reduces by half when compared to the networks with two decoders. The addition of parallel convolution path at the end of the decoder has very less effect on the number of parameters. From Table 2, it can be seen that our proposed networks 1Enc 1Dec + Conv MC and 1Enc 1Dec + Conv MD have a nearly equal number of parameters to 1Enc 1Dec M (U-Net). And it is also clear that our proposed networks 1Enc 1Dec + Conv MC and 1Enc 1Dec + Conv MD have a 50% reduction in the number of parameters compared to 1Enc 2Dec MC and 1Enc 2Dec MD.
The running time is the average time taken by the network to process a single image. The running time of the network depends on the number of parameters. The network with a higher number of parameters will have larger running time compared to the network with less number of parameters. From Table 2, it can be seen that our proposed networks 1Enc 1Dec + Conv MC and 1Enc 1Dec + Conv MD have running time nearly equal to 1Enc 1Dec M (U-Net). And it is also clear that our proposed networks 1Enc 1Dec + Conv MC and 1Enc 1Dec + Conv MD show nearly 1.4 speed-up compared to 1Enc 2Dec MC and 1Enc 2Dec MD.
Some of the results obtained using our best network (1Enc 1Dec + Conv MD) are shown in Figure 4. In the figure, first row depicts the segmentation results obtained using polyp test data while second row depicts the segmentation results obtained using cup and disc test data. In the images, contour drawn by red color denotes ground truth and contour drawn by yellow color denotes the predicted output.
The networks 1Enc 1Dec + Conv MC and 1Enc 1Dec + Conv MD outputs contour and distance along with the masks. This contour and distance maps helps in regulating the segmentation results. In Figures 5 and 6 the predicted masks, contour and distance maps obtained are compared with the ground truth masks, contour and distance maps. From Figure 5, it can be seen that the predicted masks are the region filled versions of the estimated contours. A similar effect can also be seen in Figure 6 where masks are contained by the predicted distance maps. And since, the distance map is obtained by regression we did not get a pixel level accurate map but instead we get a map very close to the ground truth distance map. This shows how well distance maps are acting as regularizers.
The difficulty of having accurate segmentation is attributed to variability in shape, texture, size, and color. Taking shape into consideration, polyp has higher variability when compared to cup and disc. Similarly, cup has higher variability when compared to disc. Because of this, disc has highest dice and jaccard when compared to both cup and polyp. This can be verified in Table 1.
So, in order to evaluate the effect of distance map, polyp and cup segmentation are the better choices. In Table 3, the results of using D1, D2 and D3 distance maps as constraints for cup, disc, and polyp are shown. It can be seen that for disc there is not much difference in the scores. While for cup there is a slight improvement in using distance D3 over others. But for the case of polyp, using distance D3 shows considerable improvement over others. In Figure 7, results obtained using three distance maps as regularizers are shown and compared with the ground truth. It is clear that the mask obtained by having distance D3 as a regularizer, gives a smooth and accurate segmentation compared to others.
An intuitive explanation for this observation could be that the performs better in the absence of discontinuities and in the presence of smooth variations, and as seen in Figure 2, the distance map D3 is smoother when compared to the distance maps D1 and D2 and hence could be the reason for its superior performance.
In this paper, we proposed a deep multi-task network for the joint task of segmentation and shape learning. The network was shown to perform comparable to and in certain cases better than the previously proposed state-of-the-art FCNs, with an advantage of having lesser number of parameters and thereby consuming lesser time for training and inference. We also explored different ways in which spatial information can be incorporated and showed the impact of different distance maps on the segmentation tasks. A good future work would be to explore different ways of learning shape information other than contour map or distance map.
- [Chen et al.(2016)Chen, Qi, Yu, and Heng] H. Chen, X. Qi, L. Yu, and P. Heng. Dcan: Deep contour-aware networks for accurate gland segmentation. In doi: 10.1109/CVPR.2016.273.
- [Fu et al.(2018)Fu, Cheng, Xu, Wong, Liu, and Cao] Huazhu Fu, Jun Cheng, Yanwu Xu, Damon Wing Kee Wong, Jiang Liu, and Xiaochun Cao. Joint optic disc and cup segmentation based on multi-label deep network and polar transformation. IEEE Transactions on Medical Imaging, 2018.
- [Grady(2006)] Leo Grady. Random walks for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 28(11):1768–1783, 2006.
- [Kass et al.(1988)Kass, Witkin, and Terzopoulos] Michael Kass, Andrew Witkin, and Demetri Terzopoulos. Snakes: Active contour models. International journal of computer vision, 1(4):321–331, 1988.
- [Ker et al.(2018)Ker, Wang, Rao, and Lim] Justin Ker, Lipo Wang, Jai Rao, and Tchoyoson Lim. Deep learning applications in medical image analysis. IEEE Access, 6:9375–9389, 2018.
- [Litjens et al.(2017)Litjens, Kooi, Bejnordi, Setio, Ciompi, Ghafoorian, van der Laak, van Ginneken, and Sánchez] Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen A. W. M. van der Laak, Bram van Ginneken, and Clara I. Sánchez. A survey on deep learning in medical image analysis. Medical Image Analysis, 42:60–88, Dec 2017. ISSN 1361-8415. doi: 10.1016/j.media.2017.07.005.
[Milletari et al.(2016)Milletari, Navab, and Ahmadi]
F. Milletari, N. Navab, and S. Ahmadi.
V-net: Fully convolutional neural networks for volumetric medical image segmentation.In 2016 Fourth International Conference on 3D Vision (3DV), pages 565–571, Oct 2016. doi: 10.1109/3DV.2016.79.
- [Paszke et al.(2017)Paszke, Gross, Chintala, Chanan, Yang, DeVito, Lin, Desmaison, Antiga, and Lerer] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
- [Ronneberger et al.(2015)Ronneberger, Fischer, and Brox] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
- [Sarker et al.(2018)Sarker, Rashwan, Akram, Banu, Saleh, Singh, Chowdhury, Abdulwahab, Romani, Radeva, and Puig] Md. Mostafa Kamal Sarker, Hatem A. Rashwan, Farhan Akram, Syeda Furruka Banu, Adel Saleh, Vivek Kumar Singh, Forhad U. H. Chowdhury, Saddam Abdulwahab, Santiago Romani, Petia Radeva, and Domenec Puig. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2018, pages 21–29, Cham, 2018. Springer International Publishing. ISBN 978-3-030-00934-2.
- [Shankaranarayana et al.(2017)Shankaranarayana, Ram, Mitra, and Sivaprakasam] Sharath M Shankaranarayana, Keerthi Ram, Kaushik Mitra, and Mohanasankar Sivaprakasam. Joint optic disc and cup segmentation using fully convolutional and adversarial networks. In Fetal, Infant and Ophthalmic Medical Image Analysis, pages 168–176. Springer, 2017.
- [Shi and Malik(2000)] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888–905, 2000.
- [Tan et al.(2018)Tan, Zhao, Yan, Li, Metaxas, and Zhan] C. Tan, L. Zhao, Z. Yan, K. Li, D. Metaxas, and Y. Zhan. Deep multi-task and task-specific feature learning network for robust shape preserved organ segmentation. In 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pages 1221–1224, April 2018. doi: 10.1109/ISBI.2018.8363791.
- [Vázquez et al.(2017)Vázquez, Bernal, Sánchez, Fernández-Esparrach, López, Romero, Drozdzal, and Courville] David Vázquez, Jorge Bernal, F Javier Sánchez, Gloria Fernández-Esparrach, Antonio M López, Adriana Romero, Michal Drozdzal, and Aaron Courville. A benchmark for endoluminal scene segmentation of colonoscopy images. Journal of healthcare engineering, 2017, 2017.
- [Yan et al.(2018)Yan, Yang, and Cheng] Zengqiang Yan, Xin Yang, and Kwang-Ting Tim Cheng. A deep model with shape-preserving loss for gland instance segmentation. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2018, pages 138–146, Cham, 2018. Springer International Publishing. ISBN 978-3-030-00934-2.
- [Yu et al.(2017)Yu, Chen, Dou, Qin, and Heng] L. Yu, H. Chen, Q. Dou, J. Qin, and P. Heng. Automated melanoma recognition in dermoscopy images via very deep residual networks. IEEE Transactions on Medical Imaging, 36(4):994–1004, April 2017. ISSN 0278-0062. doi: 10.1109/TMI.2016.2642839.
- [Zhang et al.(2010)Zhang, Yin, Liu, Wong, Tan, Lee, Cheng, and Wong] Zhuo Zhang, Feng Shou Yin, Jiang Liu, Wing Kee Wong, Ngan Meng Tan, Beng Hai Lee, Jun Cheng, and Tien Yin Wong. Origa-light: An online retinal fundus image database for glaucoma analysis and research. In Engineering in Medicine and Biology Society (EMBC), 2010 Annual International Conference of the IEEE, pages 3065–3068. IEEE, 2010.
Appendix A Evaluation metrics
Jaccard index (also known as intersection over union, IoU) is defined as the size of the intersection divided by the size of the union of the sample sets, and it is calculated as follows:
where A corresponds to the output of the method and B to the actual ground truth.
DICE similarity score is a statistic also used for comparing the similarity of two samples. It is calculated as follows:
where X and Y correspond, respectively, to the output of the method and the ground truth image.