ClamNet: Using contrastive learning with variable depth Unets for medical image segmentation

06/10/2022
by   Samayan Bhattacharya, et al.
0

Unets have become the standard method for semantic segmentation of medical images, along with fully convolutional networks (FCN). Unet++ was introduced as a variant of Unet, in order to solve some of the problems facing Unet and FCNs. Unet++ provided networks with an ensemble of variable depth Unets, hence eliminating the need for professionals estimating the best suitable depth for a task. While Unet and all its variants, including Unet++ aimed at providing networks that were able to train well without requiring large quantities of annotated data, none of them attempted to eliminate the need for pixel-wise annotated data altogether. Obtaining such data for each disease to be diagnosed comes at a high cost. Hence such data is scarce. In this paper we use contrastive learning to train Unet++ for semantic segmentation of medical images using medical images from various sources including magnetic resonance imaging (MRI) and computed tomography (CT), without the need for pixel-wise annotations. Here we describe the architecture of the proposed model and the training method used. This is still a work in progress and so we abstain from including results in this paper. The results and the trained model would be made available upon publication or in subsequent versions of this paper on arxiv.

READ FULL TEXT VIEW PDF
07/12/2018

Learning to Segment Medical Images with Scribble-Supervision Alone

Semantic segmentation of medical images is a crucial step for the quanti...
03/23/2018

Deep learning and its application to medical image segmentation

One of the most common tasks in medical imaging is semantic segmentation...
10/06/2016

PCA-aided Fully Convolutional Networks for Semantic Segmentation of Multi-channel fMRI

Semantic segmentation of functional magnetic resonance imaging (fMRI) ma...
02/11/2020

Finding novelty with uncertainty

Medical images are often used to detect and characterize pathology and d...
08/15/2018

Holographic Visualisation of Radiology Data and Automated Machine Learning-based Medical Image Segmentation

Within this thesis we propose a platform for combining Augmented Reality...
11/21/2018

Retina U-Net: Embarrassingly Simple Exploitation of Segmentation Supervision for Medical Object Detection

The task of localizing and categorizing objects in medical images often ...
11/10/2021

Conditional Alignment and Uniformity for Contrastive Learning with Continuous Proxy Labels

Contrastive Learning has shown impressive results on natural and medical...

1 Introduction

Figure 1: Different stages of image transformation by the proposed method from input chest X-ray image (on the left) to output segmentation (on the right)

Image segmentation is highly important for medical images as it allows human professionals to deal with the more challenging patients, leaving the ordinary patients to the automatic system. As opposed to classification, segmentation provides some insight into how the system is analyzing the image and hence is more reliable for applications at scale.

Encoder-decoder networks are widely used for semantic segmentation of images [1, 2, 2, 3, 4, 5]. Unet [75]

introduced skip connections between the encoder and the decoder. This improved performance of the network by allowing the convolutional layers in the decoder to access the fine-grained feature maps from the shallow layers of the encoder. However, the appropriate depth of the network for a particular application had to be determined empirically or heuristically, leading to a waste of resources and suboptimal results. The standard way to address this issue was to train networks of different depths and take an ensemble of these at inference time

[6, 7, 8]. However, this practice was inefficient while deploying at scale [9, 10]. Another issue was that the feature maps had to be of the same size to be combined. However, there was no theoretical guarantee that this was the most efficient combination.

Unet++ [75]

was introduced with a built-in ensemble of Unets of different depths and was able to combine feature maps of different sizes. Hence it was able to perform better than classical Unets.It achieved these by including additional convolutional units in between the encoder and decoder parts of the standard Unet architecture.This also allowed the model to be pruned to improve efficiency without adversely affecting performance. However, it still relied heavily on pixel-wise annotated data to train. Such annotations are tediously obtained at the cost of a lot of man-hours and financial expenditure. Hence, large pixel-wise annotated datasets of medical images are hard to come by. Datasets with image specific labels are much more easily available. However, these are used for classification rather than semantic segmentation of images. Classification networks work as a black box and the only way to gain insights into what part of the image the network focussed on, while classifying it as belonging to a particular class, can only be obtained by using tools like gradcam

[76]. Such tools analyze the gradient map of a particular convolutional layer and indicates the region of interest. For medical images, the region of interest may or may not be restricted to the abnormality we are looking for in an image to diagnose an underlying medical condition. For example, for diagnosing pneumonia using CT images, a patch might occur in different parts of the lung and hence the whole lung would be the region of interest for the network. This does not indicate if there is a patch or where it is.

Contrastive learning [73]

is commonly used for self-supervised learning. It works by making sure that the pixels with same label produce the same output in two identical models, while the pixels with different labels produce different outputs. In this paper, we use contrastive learning in an unsupervised setting.

In this paper we propose to use contrastive learning to train a pair of unet++ models using labelled images without pixel-wise annotations. For this purpose, we use a probabilistic loss term. We use our model on different types of medical images, including X-ray, magnetic resonance imaging (MRI) and computed tomography (CT). We also test our model for the diagnosis of different medical conditions, including pneumonia, carcinoma, trauma, etc. Preliminary results are promising and we would present a comprehensive study upon publication of our work or in subsequent versions of this paper on arxiv.

2 Related works

2.1 Self-supervised contrastive learning

This approach works by contrasting the outputs of two identical models for positive pairs (pairs of inputs belonging t the same class) and negative pairs. In recent years several approaches have used contrastive loss [73] for learning representations of visual information [41, 42, 43, 44, 45, 46, 48]. These approaches use data augmentation techniques to generate positive pairs and contrast them with other instances. Some approaches use hard negative mining techniques [54, 55]. Some approaches used data banks to store representations as performance is seen to improve with increase in number of negative pairs [44, 53, 46]. Some approaches have extended its use to semantic segmentation of images by classification of individual pixels[47, 49, 50, 51, 52] and by using large datasets of pixel-wise annotate images.

2.2 Semantic segmentation

Convolutional Neural Networks (CNNs) have been used for segmentation of images for a long time [56, 57]. Such models have undergone incremental improvements by using more convolutional layers in deeper networks [58, 59, 60, 61, 62, 63, 64, 65, 66]

. An alternative technique is pretraining the model on a large dataset, for example the ImageNet classification dataset or weak supervision techniques like unlabeled images

[4, 11, 12, 13, 14, 15], bounding boxes [16, 17, 18, 19], image-level labels [1, 22, 23, 18] or points [2] and scribbles [20, 21].

Another approach learns pixel relations by using region-based loss function

[70, 71]. A region Mutual Information (MI) loss is used to maximize the MI between patch label distributions [71]

. A pairwise affinity loss based on KL divergence between the probabilities, predicted for belonging to each class, is used by

[70]. Some works by [67, 68, 69, 74] performed instance and semantic segmentation by using metric learning based on similarity and dissimilarity pairs. The work by [72]

trained the feature extractor of a semantic segmentation model by maximizing log likelihood of an extracted feature under several vMF distribution models. At the time of inference, they used K-means clustering to segment the pixel features and used K nearest neighbours to get labels from available segments.

2.3 Unet++

Since Unet++ was first introduced by [35], it has been used in multiple ways. [36, 37, 38, 39] used it as a baseline, while [29, 30, 31, 32, 33, 34] proposed models inspired by the Unet++ architecture. It was used for natural image segmentation by [26], for biomedical image segmentation by [24, 25] and for satellite image segmentation [27, 28]. It has also been used for contact prediction by [15].

3 Proposed model

Figure 2: Proposed model architecture. The shape resembles a clam. The name might be altered at a later stage of our work.

3.1 Unet++ architecture

We use a classical unet++ architecture with 5 levels of convolutional layers. The input image is of size 256x256. The subsequent feature maps have double the number of channels of the layer above it and length and breadth dimensions are reduced by half, using stride=2 for the second convolution of the two convolutions at each level. The kernel size is higher in lower levels allowing more context information. The intermediate convolutions are taken according to the architecture of the original unet++.

As an alteration of the above approach, we try having two subsequent layers with the same dimensions of feature map. We make this change for randomly selected levels. Initial results indicate that this approach gives better results than a deeper unet++.

3.2 Preprocessing

In this study we use labelled images with image-wise annotations instead of pixel-wise annotations. The images are then passed to an image segmentation network that is trained to detect the organ of interest. Segmentation of organs is much easier due to easy availability of data. This is because humans have few organs but each organ has many diseases. After segmentation, the image is cropped such that it contains minimum background. This process is helpful for the probabilistic loss function we describe later in this paper. Each image is resized to 256x256 pixels.

3.3 Contrastive training

We take two identical models, as indicated above.These are trained simultaneously on positive pairs and negative pairs of slices of the image. The target is to have both networks produce the same output for positive pairs and different outputs for negative pairs.

Positive pair is a pair of slices belonging to the same class label. We generate a positive pair by (i) blurring the image (such that resultant image has a resolution between 90 to 100 percent of the original resolution) (ii) stretch and compression distortions (such that resultant image has dimensions between 80 to 100 percent of the original resolution), followed by resizing the image to the original size or padding it to the original size. Apart from these, we also generate positive pairs by taking slices from images labelled as normal (negative for the medical condition to be diagnosed).

Negative pairs are pairs of slices that belong to different label classes. These are generated by taking a slice from a positive image and one from a negative image. Since the whole of the positive image does not contain the features of the medical condition to be diagnosed, we use the probabilistic loss function.

3.4 Probabilistic loss function

Since we are trying to produce pixel-wise annotated images by training the model on image-wise annotated images, there is no way to be sure which part of the image has markers of the medical condition to be diagnosed. But we increase the probability of the marker being in a given region of the image by segmenting the organ which is supposed to be affected and removing the background. Now, we consider the probability of a marker being present in a given region of the image while calculating the loss.

Mathematically this is given as,

(1)

where , representing the target labels and representing the predicted probabilities for class c and pixel in the batch, where , representing the number of pixels in one batch.

The total loss of the model is the weighted sum of the individual hybrid losses, given by , where d is the number of slices in the batch and

is the weight that is the probability of the slice containing the marker for the medical condition to be diagnosed. If this probability is not known for a dataset, it is treated as a hyperparameter, to be tuned.

4 Conclusion

In this paper we proposed a novel method for using contrastive learning, in an unsupervised setting, to train a UNET++ architecture. We are in the process of trying out different methods and testing them on different datasets. This paper is intended to provide an overview of the method. We do not make any claims about its performance in this paper. We plan to provide a comprehensive report of our final model, datasets used to train it and a comparative study of the results obtained for our model versus other commonly used architectures, upon publication of our work or in subsequent versions of this paper on arxiv.

References

  • [1]

    Shen, D., Wu, G. and Suk, H.I., 2017. Deep learning in medical image analysis. Annual review of biomedical engineering, 19, p.221.

  • [2] Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian, M., Van Der Laak, J.A., Van Ginneken, B. and Sánchez, C.I., 2017. A survey on deep learning in medical image analysis. Medical image analysis, 42, pp.60-88.
  • [3] Chartrand, G., Cheng, P.M., Vorontsov, E., Drozdzal, M., Turcotte, S. and Pal, C.J., Deep Learning: A Primer for Radiologists. RadioGraphics [Internet]. 2017; 37 (7): 2113–31.
  • [4] Falk, T., Mai, D., Bensch, R., Çiçek, Ö., Abdulkadir, A., Marrakchi, Y., Böhm, A., Deubner, J., Jäckel, Z., Seiwald, K. and Dovzhenko, A., 2019. U-Net: deep learning for cell counting, detection, and morphometry. Nature methods, 16(1), pp.67-70.
  • [5] Tajbakhsh, N., Jeyaseelan, L., Li, Q., Chiang, J.N., Wu, Z. and Ding, X., 2020. Embracing imperfect datasets: A review of deep learning solutions for medical image segmentation. Medical Image Analysis, 63, p.101693.
  • [6]

    Dietterich, T.G., 2000, June. Ensemble methods in machine learning. In International workshop on multiple classifier systems (pp. 1-15). Springer, Berlin, Heidelberg.

  • [7]

    Shin, H.C., Roth, H.R., Gao, M., Lu, L., Xu, Z., Nogues, I., Yao, J., Mollura, D. and Summers, R.M., 2016. Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE transactions on medical imaging, 35(5), pp.1285-1298.

  • [8] Ciompi, F., de Hoop, B., van Riel, S.J., Chung, K., Scholten, E.T., Oudkerk, M., de Jong, P.A., Prokop, M. and van Ginneken, B., 2015. Automatic classification of pulmonary peri-fissural nodules in computed tomography using an ensemble of 2D views and a convolutional neural network out-of-the-box. Medical image analysis, 26(1), pp.195-202.
  • [9] Bengio, Y., 2009. Learning deep architectures for AI. Now Publishers Inc.
  • [10] Zhang, Y. and Yang, Q., 2021. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering.
  • [11]

    Liu, C., Zoph, B., Neumann, M., Shlens, J., Hua, W., Li, L.J., Fei-Fei, L., Yuille, A., Huang, J. and Murphy, K., 2018. Progressive neural architecture search. In Proceedings of the European conference on computer vision (ECCV) (pp. 19-34).

  • [12]

    Long, J., Shelhamer, E. and Darrell, T., 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431-3440).

  • [13] Chaurasia, A. and Culurciello, E., 2017, December. Linknet: Exploiting encoder representations for efficient semantic segmentation. In 2017 IEEE Visual Communications and Image Processing (VCIP) (pp. 1-4). IEEE.
  • [14] Yu, F., Wang, D., Shelhamer, E. and Darrell, T., 2018. Deep layer aggregation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2403-2412).
  • [15] Shenoy, A., 2019. Feature optimization of contact map predictions based on inter-residue distances and U-Net++ architecture. no. July.
  • [16] He, K., Gkioxari, G., Dollár, P. and Girshick, R., 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2961-2969).
  • [17] Meyer, M.G., Hayenga, J.W., Neumann, T., Katdare, R., Presley, C., Steinhauer, D.E., Bell, T.M., Lancaster, C.A. and Nelson, A.C., 2015. The Cell‐CT 3‐dimensional cell imaging technology platform enables the detection of lung cancer using the noninvasive LuCED sputum test. Cancer cytopathology, 123(9), pp.512-523.
  • [18] Zhao, H., Qi, X., Shen, X., Shi, J. and Jia, J., 2018. Icnet for real-time semantic segmentation on high-resolution images. In Proceedings of the European conference on computer vision (ECCV) (pp. 405-420).
  • [19]

    Jiang, J., Hu, Y.C., Liu, C.J., Halpenny, D., Hellmann, M.D., Deasy, J.O., Mageras, G. and Veeraraghavan, H., 2018. Multiple resolution residually connected feature streams for automatic lung tumor segmentation from CT images. IEEE transactions on medical imaging, 38(1), pp.134-144.

  • [20] Simonyan, K. and Zisserman, A., 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
  • [21] Zhu, Q., Du, B., Turkbey, B., Choyke, P.L. and Yan, P., 2017, May. Deeply-supervised CNN for prostate segmentation. In 2017 international joint conference on neural networks (IJCNN) (pp. 178-184). IEEE.
  • [22] Dou, Q., Yu, L., Chen, H., Jin, Y., Yang, X., Qin, J. and Heng, P.A., 2017. 3D deeply supervised network for automated segmentation of volumetric medical images. Medical image analysis, 41, pp.40-54.
  • [23] Kistler, M., Bonaretti, S., Pfahrer, M., Niklaus, R. and Büchler, P., 2013. The virtual skeleton database: an open access repository for biomedical research and collaboration. Journal of medical Internet research, 15(11), p.e2930.
  • [24] Zyuzin, V., Sergey, P., Mukhtarov, A., Chumarnaya, T., Solovyova, O., Bobkova, A. and Myasnikov, V., 2018, May. Identification of the left ventricle endocardial border on two-dimensional ultrasound images using the convolutional neural network Unet. In 2018 Ural Symposium on Biomedical Engineering, Radioelectronics and Information Technology (USBEREIT) (pp. 76-78). IEEE.
  • [25] Cui, H., Liu, X. and Huang, N., 2019, October. Pulmonary vessel segmentation based on orthogonal fused u-net++ of chest CT images. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 293-300). Springer, Cham.
  • [26]

    Sun, K., Xiao, B., Liu, D. and Wang, J., 2019. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 5693-5703).

  • [27] Peng, D., Zhang, Y. and Guan, H., 2019. End-to-end change detection for high resolution satellite images using improved UNet++. Remote Sensing, 11(11), p.1382.
  • [28] Zhang, Y., Gong, W., Sun, J. and Li, W., 2019. Web-Net: A novel nest networks with ultra-hierarchical sampling for building extraction from aerial imageries. Remote Sensing, 11(16), p.1897.
  • [29] Zhang, J., Jin, Y., Xu, J., Xu, X. and Zhang, Y., 2018. Mdu-net: Multi-scale densely connected u-net for biomedical image segmentation. arXiv preprint arXiv:1812.00352.
  • [30] Chen, F., Ding, Y., Wu, Z., Wu, D. and Wen, J., 2018, December. An improved framework called Du++ applied to brain tumor segmentation. In 2018 15th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP) (pp. 85-88). IEEE.
  • [31] Zhou, C., Chen, S., Ding, C. and Tao, D., 2018, September. Learning contextual and attentive information for brain tumor segmentation. In International MICCAI brainlesion workshop (pp. 497-507). Springer, Cham.
  • [32] Wu, S., Wang, Z., Liu, C., Zhu, C., Wu, S. and Xiao, K., 2019, July. Automatical segmentation of pelvic organs after hysterectomy by using dilated convolution u-net++. In 2019 IEEE 19th International Conference on Software Quality, Reliability and Security Companion (QRS-C) (pp. 362-367). IEEE.
  • [33] Song, T., Meng, F., Rodriguez-Paton, A., Li, P., Zheng, P. and Wang, X., 2019. U-next: A novel convolution neural network with an aggregation u-net architecture for gallstone segmentation in ct images. IEEE Access, 7, pp.166823-166832.
  • [34] Yang, C. and Gao, F., 2019, October. EDA-Net: dense aggregation of deep and shallow information achieves quantitative photoacoustic blood oxygenation imaging deep in human breast. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 246-254). Springer, Cham.
  • [35] Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N. and Liang, J., 2018. Unet++: A nested u-net architecture for medical image segmentation. In Deep learning in medical image analysis and multimodal learning for clinical decision support (pp. 3-11). Springer, Cham.
  • [36] Sun, K., Zhao, Y., Jiang, B., Cheng, T., Xiao, B., Liu, D., Mu, Y., Wang, X., Liu, W. and Wang, J., 2019. High-resolution representations for labeling pixels and regions. arXiv preprint arXiv:1904.04514.
  • [37] Fang, Y., Chen, C., Yuan, Y. and Tong, K.Y., 2019, October. Selective feature aggregation network with area-boundary constraints for polyp segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 302-310). Springer, Cham.
  • [38] Fang, J., Zhang, Y., Xie, K., Yuan, S. and Chen, Q., 2019, October. An improved mpb-cnn segmentation method for edema area and neurosensory retinal detachment in sd-oct images. In International Workshop on Ophthalmic Medical Image Analysis (pp. 130-138). Springer, Cham.
  • [39] Meng, C., Sun, K., Guan, S., Wang, Q., Zong, R. and Liu, L., 2020. Multiscale dense convolutional neural network for DSA cerebrovascular segmentation. Neurocomputing, 373, pp.123-134..
  • [40] Hadsell, R., Chopra, S. and LeCun, Y., 2006, June. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06) (Vol. 2, pp. 1735-1742). IEEE.
  • [41] Chen, T., Kornblith, S., Norouzi, M. and Hinton, G., 2020, November. A simple framework for contrastive learning of visual representations. In International conference on machine learning (pp. 1597-1607). PMLR.
  • [42] Chen, T., Kornblith, S., Swersky, K., Norouzi, M. and Hinton, G.E., 2020. Big self-supervised models are strong semi-supervised learners. Advances in neural information processing systems, 33, pp.22243-22255.
  • [43] Dosovitskiy, A., Springenberg, J.T., Riedmiller, M. and Brox, T., 2014. Discriminative unsupervised feature learning with convolutional neural networks. Advances in neural information processing systems, 27.
  • [44] He, K., Fan, H., Wu, Y., Xie, S. and Girshick, R., 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9729-9738).
  • [45] Li, J., Zhou, P., Xiong, C. and Hoi, S.C., 2020. Prototypical contrastive learning of unsupervised representations. arXiv preprint arXiv:2005.04966.
  • [46] Wu, Z., Xiong, Y., Yu, S.X. and Lin, D., 2018. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3733-3742).
  • [47] Chaitanya, K., Erdil, E., Karani, N. and Konukoglu, E., 2020. Contrastive learning of global and local features for medical image segmentation with limited annotations. Advances in Neural Information Processing Systems, 33, pp.12546-12558.
  • [48] Ye, M., Zhang, X., Yuen, P.C. and Chang, S.F., 2019. Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6210-6219).
  • [49]

    O Pinheiro, P.O., Almahairi, A., Benmalek, R., Golemo, F. and Courville, A.C., 2020. Unsupervised learning of dense visual representations. Advances in Neural Information Processing Systems, 33, pp.4489-4500.

  • [50] Wang, X., Zhang, R., Shen, C., Kong, T. and Li, L., 2021. Dense contrastive learning for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 3024-3033).
  • [51] Ding, J., Xie, E., Xu, H., Jiang, C., Li, Z., Luo, P. and Xia, G.S., 2021. Unsupervised pretraining for object detection by patch reidentification. arXiv preprint arXiv:2103.04814.
  • [52] Xie, Z., Lin, Y., Zhang, Z., Cao, Y., Lin, S. and Hu, H., 2021. Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16684-16693).
  • [53] Tian, Y., Krishnan, D. and Isola, P., 2020, August. Contrastive multiview coding. In European conference on computer vision (pp. 776-794). Springer, Cham.
  • [54] Kalantidis, Y., Sariyildiz, M.B., Pion, N., Weinzaepfel, P. and Larlus, D., 2020. Hard negative mixing for contrastive learning. Advances in Neural Information Processing Systems, 33, pp.21798-21809.
  • [55] Robinson, J., Chuang, C.Y., Sra, S. and Jegelka, S., 2020. Contrastive learning with hard negative samples. arXiv preprint arXiv:2010.04592.
  • [56] Farabet, C., Couprie, C., Najman, L. and LeCun, Y., 2012. Learning hierarchical features for scene labeling. IEEE transactions on pattern analysis and machine intelligence, 35(8), pp.1915-1929.
  • [57] Long, J., Shelhamer, E. and Darrell, T., 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431-3440).
  • [58] Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K. and Yuille, A.L., 2014. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062.
  • [59] Chen, L.C., Papandreou, G., Schroff, F. and Adam, H., 2017. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587.
  • [60] Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F. and Adam, H., 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV) (pp. 801-818).
  • [61] Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z. and Lu, H., 2019. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3146-3154).
  • [62] Vemulapalli, R., Tuzel, O., Liu, M.Y. and Chellapa, R., 2016. Gaussian conditional random field network for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3224-3233).
  • [63] Yuan, Y., Chen, X. and Wang, J., 2020, August. Object-contextual representations for semantic segmentation. In European conference on computer vision (pp. 173-190). Springer, Cham.
  • [64] Yuan, Y., Huang, L., Guo, J., Zhang, C., Chen, X. and Wang, J., 2018. Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916.
  • [65] Zhang, H., Zhang, H., Wang, C. and Xie, J., 2019. Co-occurrent features in semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 548-557).
  • [66] Zhao, H., Shi, J., Qi, X., Wang, X. and Jia, J., 2017. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2881-2890).
  • [67] De Brabandere, B., Neven, D. and Van Gool, L., 2017. Semantic instance segmentation with a discriminative loss function. arXiv preprint arXiv:1708.02551.
  • [68] Fathi, A., Wojna, Z., Rathod, V., Wang, P., Song, H.O., Guadarrama, S. and Murphy, K.P., 2017. Semantic instance segmentation via deep metric learning. arXiv preprint arXiv:1703.10277.
  • [69] Fathi, A., Wojna, Z., Rathod, V., Wang, P., Song, H.O., Guadarrama, S. and Murphy, K.P., 2017. Semantic instance segmentation via deep metric learning. arXiv preprint arXiv:1703.10277..
  • [70] Ke, T.W., Hwang, J.J., Liu, Z. and Yu, S.X., 2018. Adaptive affinity fields for semantic segmentation. In Proceedings of the European conference on computer vision (ECCV) (pp. 587-602).
  • [71] SZhao, S., Wang, Y., Yang, Z. and Cai, D., 2019. Region mutual information loss for semantic segmentation. Advances in Neural Information Processing Systems, 32..
  • [72] Hwang, J.J., Yu, S.X., Shi, J., Collins, M.D., Yang, T.J., Zhang, X. and Chen, L.C., 2019. Segsort: Segmentation by discriminative sorting of segments. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 7334-7344).
  • [73] Hadsell, R., Chopra, S. and LeCun, Y., 2006, June. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06) (Vol. 2, pp. 1735-1742). IEEE.
  • [74] Kong, S. and Fowlkes, C.C., 2018. Recurrent pixel embedding for instance grouping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 9018-9028).
  • [75] Zhou, Z., Siddiquee, M.M.R., Tajbakhsh, N. and Liang, J., 2019. Unet++: Redesigning skip connections to exploit multiscale features in image segmentation. IEEE transactions on medical imaging, 39(6), pp.1856-1867.
  • [76] Selvaraju, R.R., Das, A., Vedantam, R., Cogswell, M., Parikh, D. and Batra, D., 2016. Grad-CAM: Why did you say that?. arXiv preprint arXiv:1611.07450.