Pairwise Relation Learning for Semi-supervised Gland Segmentation

08/06/2020 ∙ by Yutong Xie, et al. ∙ 4

Accurate and automated gland segmentation on histology tissue images is an essential but challenging task in the computer-aided diagnosis of adenocarcinoma. Despite their prevalence, deep learning models always require a myriad number of densely annotated training images, which are difficult to obtain due to extensive labor and associated expert costs related to histology image annotations. In this paper, we propose the pairwise relation-based semi-supervised (PRS^2) model for gland segmentation on histology images. This model consists of a segmentation network (S-Net) and a pairwise relation network (PR-Net). The S-Net is trained on labeled data for segmentation, and PR-Net is trained on both labeled and unlabeled data in an unsupervised way to enhance its image representation ability via exploiting the semantic consistency between each pair of images in the feature space. Since both networks share their encoders, the image representation ability learned by PR-Net can be transferred to S-Net to improve its segmentation performance. We also design the object-level Dice loss to address the issues caused by touching glands and combine it with other two loss functions for S-Net. We evaluated our model against five recent methods on the GlaS dataset and three recent methods on the CRAG dataset. Our results not only demonstrate the effectiveness of the proposed PR-Net and object-level Dice loss, but also indicate that our PRS^2 model achieves the state-of-the-art gland segmentation performance on both benchmarks.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Quantitative measurement of glands on histology tissue images is an effective means to assist pathologists in diagnosing the malignancy of adenocarcinoma [6]. Manual annotation of glands requires specialized knowledge and intense concentration, and is often time-consuming. Automated gland segmentation avoids many of these issues and provides pathologists an unprecedented ability to reliably characterise and quantify glands. Although being increasingly studied to improve its accuracy, efficiency and objectivity [7, 19, 16], this task remains challenging mainly due to (1) inadequate training data with pixel-wise dense annotations and (2) small gaps and adhesive edges between adjacent glands.

Currently, most available gland segmentation methods are based on deep convolutional neural networks (DCNNs) 

[7, 19, 2, 16, 18, 13]. Chen   [2] presented a deep contour-aware network that harnesses multi-scale features to separate glands from the background and also employs the complementary information of contours to delineate each gland. Qu et al. [13]

proposed a full resolution convolutional neural network to improve the gland localization and introduced a variance constrained cross-entropy loss to advance the shape similarity of glands. Yan et al. 

[19] developed a shape-aware adversarial learning model for simultaneous gland segmentation and contour detection. Although superior to the performance of previous solutions, the performance of these DCNN-based gland segmentation methods depends heavily on a substantial number of training images with pixel-wise labels, which are difficult to obtain due to the tremendous efforts and costs tied to the dense annotations of histology images.

To alleviate the burden of data annotation, semi-supervised segmentation models have been developed to jointly use labeled and unlabeled data for co-training [17]. Recent semi-supervised learning (SSL) methods are usually based on consistency regularization [12]. Specifically, unlabeled data are exploited according to the smoothness assumption that certain perturbations of an input should not significantly vary the prediction [12, 9, 4, 10, 20]. Nevertheless, these methods only measure the consistency between different perturbations of an input image. In fact, different images may contain the same kind of foreground objects (e.g., glands). The objects on two images may share consistent representations in the feature space as long as they have the same semantic label. We advocate that such pairwise consistency should be explored to establish an unsupervised way to learn generalized feature representation from unlabeled data.

In this paper, we propose the pairwise relation-based semi-supervised (PRS) model for gland segmentation on histology tissue images. This model is composed of a supervised segmentation network (S-Net) and an unsupervised pairwise relation network (PR-Net). The PR-Net is trained to boost its ability to learn both semantic consistency and image representation via exploiting the semantic consistency between each pair of images in the feature space. Since the encoders of S-Net and PR-Net share parameters, the ability learned by PR-Net can be transferred to S-Net to improve its segmentation performance. Meanwhile, we employ the object-level Dice loss to impose additional constraints on each individual gland, and thus addresses the issues caused by touching glands. The object-level Dice was originally proposed in  [14] as a performance metric, but not as a loss function. We transform it as a loss and combine this loss with the pixel-level cross-entropy loss and global-level Dice loss to form a multi-level loss for S-Net. We evaluate the proposed PRS model on the GlaS Challenge dataset and CRAG dataset and achieve superior performance over several recently published gland segmentation models.

The contributions include: (1) proposing the pairwise relation interaction to exploit the semantic consistency between each pair of images in the feature space, enabling the model to learn semantic consistency and image representation in an unsupervised way; (2) transforming the object-level Dice evaluation metric as a loss and employing it to address the issues caused by touching glands; and (3) constructing the PRS

model that achieves the state-of-the-art gland segmentation performance on two benchmarks.

2 Method

The proposed PRS model has two major modules: the S-Net for supervised gland segmentation and PR-Net for unsupervised semantic relation learning (see Fig. 1). Let the labeled training set with images be denoted by , the unlabeled training set with images be denoted by , and the whole training image set be denoted by . The pipeline of this model can be summarized in two steps. First, the S-Net is trained on for an initialization. Since the encoders of both networks share the same architecture and parameters, the encoders PR-Net is also initialized in this step. Then, both the S-Net and the PR-Net are jointly fine-tuned on with the parameter-sharing mechanism.

Figure 1: Diagram of the proposed PRS model.

S-Net. We use the DeepLabv3+ model [3] pretrained on PASCAL VOC 2012 dataset [5]

as S-Net. To adapt DeepLabv3+ to our task, we replace the last convolutional layer, which is task specific, with a convolutional layer that contains two output neurons to predict glands and background. The weights in this layer are randomly initialized, and the activation is set to the softmax function.

We design the following multi-level segmentation loss for S-Net, defined as follows


where is the cross-entropy loss that optimizes pixel-level accuracy, is the Dice loss that optimizes the overlap between the prediction and ground truth, and is the object-level Dice loss. Combining the first two losses is commonly used in many medical image segmentation applications and achieves remarkable success [15, 21]. However, gland segmentation requires not only to segment the glands from background, but also to separate each individual gland from others. The latter requirement is quite challenging due to the existence of touching glands. To address this challenge, we propose the object-level Dice loss as follows


where is the th segmented gland, is the ground truth gland that maximally overlaps , is the th ground truth gland, and is the segmented gland that maximally overlaps . The and denote the total number of ground truth glands and segmented glands for an input image, respectively. In this definition, the first term measures how well each segmented gland overlaps its corresponding ground truth, whereas the second term measures how well each ground truth gland overlaps its corresponding segmented gland. This loss function considers the instance-level discrepancy between a segmentation result and its ground truth, and thus is able to help S-Net learn more discriminatory feature representations for gland segmentation.

PR-Net. The PR-Net exploits the semantic consistency between each pair of images for unsupervised pairwise relation learning. It is a composition of three modules: (1) an image pair input layer, (2) an encoder

for feature extraction, and (3) a pairwise relation module (PRM). The input layer accepts a pair of images (

, ), which are randomly sampled from the whole training set , as input. The encoder shares the identical architecture and parameters with the encoder of S-Net (i.e., modified aligned Xception), whose output can be formally presented as follows


where denotes the parameters of the encoder, and and denote respectively the number of channels, height, and width of the encoded feature representation.

The PRM is proposed to highlight the targets of the same semantic class but located on two images. To this end, we first calculate the consistency relation matrix from to as follows


where represents a reshape function which collapses the and dimensions into a single dimension with elements, and the softmax function normalizes the elements in the second dimension. The measures the consistency th flattened ‘pixel’ (in the feature representation space) of to th ‘pixel’ of , where a larger indicates a higher semantic consistency between these two ‘pixels’.

Next, we perform a matrix multiplication between and to obtain the attention map of , formulated as


where is a reverse operation of , each element in can be considered as a weighted sum of over all positions, where the weights are determined by . Finally, we add to the feature map via an element-wise summation to obtain the target-highlighted feature maps , show as follows


Similarly, the can be calculated as


Both target-highlighted feature maps and have the merit of consistency relation information between and , and thus can serve as the targets of PR-Net to enforce the model to increase the semantic consistency for any pair of image feature maps. Hence, the loss function of PR-Net can be expressed as


where is the smooth loss, and

is the sigmoid function. Both

and , served as the target signals, do not perform back-propagation in each iteration. We also randomly select a pair of images and visualize their corresponding channel-wise sum of as well as in Fig. 2 to show the superiority of .

Figure 2: A pair of images and the corresponding channel-wise sum of and

Optimization of PRS model. The total loss of the proposed PRS model is defined as the weighted sum of multi-level segmentation loss and unsupervised semantic consistency loss such that


where is a weighting factor that controls the contribution of unsupervised loss. We adopt the Adam algorithm [11] with a batch size of 5 and 10 to train S-Net and PR-Net, respectively, and also set 20% of the training set as a validation set to monitor the performance of both networks. The initial learning rate is set to 1e-4 in the initialization step and 5e-5 in the fine-tuning step.

3 Experiments and Results

Materials. We adopted the 2015 MICCAI Gland Segmentation (GlaS) challenge dataset [14] and colorectal adenocarcinoma gland (CRAG) dataset [7, 1] to evaluate the proposed PRS model. The GlaS dataset contains 85 training and 80 test images (60 in Part A; 20 in Part B). The CRAG dataset has 173 training and 40 test images. When evaluating PRS on the GlaS test set, the CRAG training set was considered as unlabeled training data, and vice versa.

Evaluation Metrics. On the GlaS dataset, three metrics officially suggested by the GlaS Challenge [14] were calculated to assess the segmentation performance, including the object-level Dice (Obj-D) that represents the accuracy of delineating each individual gland, the object-level F1 score (Obj-F) that evaluates the accuracy of detecting each gland, and the object-level Hausdorff distance (Obj-H) that measures the shape similarity between each segmented gland and its ground truth. Meanwhile, all competing segmentation models were ranked according to each of these three metrics, and the sum of three ranking scores is calculated to measure the overall performance of each model. Note that a lower ranking score indicates better segmentation performance.

Implementation Details. In the training stage, we followed the suggestion in [16] to randomly crop patches from each training image as the input of both S-Net and PR-Net. The patch size was set to on the GlaS dataset and on the CRAG dataset. When training PRS model, we resized CRAG patches to if the labelled samples are from GlaS dataset, or resized GlaS patches to if the labelled samples are from CRAG dataset. To further enlarge the training dataset, we employed the online data augmentation, which includes random rotation, shear, shift, zooming, and horizontal/vertical flip, and color normalization. In the test stage, test time augmentations including cropping, horizontal/vertical flip and rotation, were also utilized to improve the robustness of segmentation. As a result, each segmentation result is the average of the results obtained on the original image and its three types of augmented copies. Moreover, the morphological opening using a square structure element with a size of was finally performed to smooth segmentation results.

Results on Two Datasets. On the GlaS dataset, we compared the proposed PRS model to five recently published gland segmentation models, including the deep contour-aware network (DCAN) [2], the minimal information loss dilated network (MILD-Net) [7], the shape-aware adversarial learning (SADL) model [19], the rotation equivariant network (Rota-Net) [8], the full resolution convolutional neural network (FullNet) [13], and the deep segmentation-emendation (DSE) model [16]. On the CRAG dataset, we compared our model to three models, i.e., DCAN, MILD-Net, and DSE. The performance of these models was given in Table 1. Note that the performance of all competing models was adopted in the literature, and the performance on the GlaS dataset is the average performance on test data part A and part B. Finally, we also compared our model to a typical semi-supervised (SS) method on both datasets, i.e., using a trained S-Net to generate segmentation predictions of unlabelled data and using a CRF-like approach to generate the proxy labels for fine-tuning the S-Net.

It shows that our model achieves the highest Obj-D, second highest Obj-F, and lowest Obj-H on the GlaS dataset. Comparing to the DSE model that performs the second best, our model improves the Obj-D by 0.7% and the Obj-H by 0.8. On the CRAG dataset, it reveals that our model achieves the highest Obj-D, highest Obj-F, and lowest Obj-H, improving the Obj-D, Obj-F and Obj-H from 88.9%, 83.5% and 120.1, which were achieved by the second best model, to 89.2%, 84.3% and 113.1, respectively. The results on both datasets indicate that the proposed PRS model is able to produce more accurate for segmentation of glands and its performance is relatively robust.

Datasets Methods Obj-D Obj-F Obj-H Rank sum
M (%) R M (%) R M R
GlaS dataset DCAN 83.9 8 81.4 8 102.9 8 24
MILD-Net 87.5 6 87.9 5 73.7 6 17
SADL 87.3 7 88.9 3 76.7 7 17
Rota-Net 88.4 5 87.2 6 68.4 5 16
FullNet 88.5 4 88.9 3 63.0 4 11
DSE 89.9 2 89.4 1 55.9 2 5
SS 89.6 3 86.9 7 62.8 3 13
Our PRS 90.6 1 89.0 2 55.1 1 4
CRAG dataset DCAN 79.4 5 73.6 5 218.8 5 15
MILD-Net 87.5 4 82.5 3 160.1 4 11
DSE 88.9 2 83.5 2 120.1 2 6
SS 87.6 3 81.6 4 145.0 3 10
Our PRS 89.2 1 84.3 1 113.1 1 3
Table 1: Gland segmentation performance of the proposed PRS model and recently published models on both GlaS and CRAG datasets. M and R denote metric value and ranking score, respectively. Note that the performance on the GlaS dataset is the average performance on test data part A and part B

4 Discusses

Trade-off between labeled and unlabeled data. A major advantage of our PRS model is to use the unlabeled images to facilitate model training, leading to (1) less requirement of densely annotated training data or (2) improved segmentation performance when the labeled training dataset is small. To validate this, we kept the test set and unlabeled training set unchanged and randomly selected 20% and 50% labeled training images, respectively, to perform the segmentation experiments on both datasets again. As a control, we also used those selected labeled training images to train S-Net in a fully-supervised manner. The segmentation performance of our PRS model and S-Net was shown in Fig. 3, from which three conclusions can be drawn. First, the segmentation performance of both models improves as the number of labeled training images increases. Second, using both labeled the unlabeled images, our model outperforms the fully-supervised S-Net steadily no matter how many labeled training images were used. More important, it is interesting that our model trained with 50% labeled images can achieve a comparable performance with the fully-supervised S-Net trained with 100% training data on both datasets. Similarly, it reveals that our model trained with 20% labeled images can achieve a comparable performance with the S-Net trained with 50% training data. It suggests that our model provides the possibility of using unlabeled data to replace almost half of labeled training images while maintaining the segmentation performance.

Figure 3: Obj-D and Obj-F values achieved on two datasets by our semi-supervised PRS model and fully-supervised S-Net, when 20%, 50% and 100% labeled training images

Multi-level segmentation loss. To demonstrate the performance gain resulted from the proposed multi-level segmentation loss, we also attempted to train the S-Net with different loss functions, including , and . The results in Table 2 reveals that (1) using the combination of Dice and cross-entropy loss can produce higher Obj-F than using the Dice loss or cross-entropy loss alone, and (2) the superior performance of our multi-level loss over the combination of Dice and cross-entropy loss confirms the effectiveness of using the object-level Dice loss to pose constraints to each individual gland.

Loss functions GlaS dataset CRAG dataset
Obj-D Obj-F Obj-H Obj-D Obj-F Obj-H
86.5 86.2 75.3 84.7 78.9 174.9
88.4 86.1 65.5 86.7 77.8 139.5
88.7 86.0 66.5 86.3 80.3 157.3
(Ours) 89.4 86.5 64.1 87.1 82.1 138.6
Table 2: Gland segmentation performance of S-Net obtained on two datasets when using different loss functions

Complexity. Two parameter-sharing DCNNs in our PRS

model are trained using the open source Pytorch software packages. In our experiments, it took about 12 hours to train our PRS

model (2 hours for the initialization step and 10 hours for the fine-tuning step) and less than 1 second to segment each test image on a server with 4 NVIDIA GTX 2080 Ti GPUs and 128GB Memory.

5 Conclusion

In this paper, we propose the PRS model for gland segmentation on histology tissue images, which consists of a supervised segmentation network with a newly designed loss and an unsupervised PR-Net that boosts its image representation ability via exploiting the semantic consistency between each pair of images in the feature space. Our results indicate that this model outperforms five recent methods on the GlaS dataset and three recent methods on the CRAG dataset. Our ablation study suggests the effectiveness of proposed loss and PR-Net. Although our model is built upon the specific application of gland segmentation, the pairwise relation-based semi-supervised strategy itself is generic and can potentially be applied to other deep model-based medical image segmentation tasks to reduce the requirement of densely annotated training images.


Y Xie, J Zhang, and Y Xia were supported in part by the National Natural Science Foundation of China under Grants 61771397, in part by the Science and Technology Innovation Committee of Shenzhen Municipality, China, under Grants JCYJ20180306171334997, and in part by Innovation Foundation for Doctor Dissertation of Northwestern Polytechnical University under Grants CX202010.


  • [1] R. Awan, K. Sirinukunwattana, D. Epstein, S. Jefferyes, U. Qidwai, Z. Aftab, I. Mujeeb, D. Snead, and N. Rajpoot (2017) Glandular morphometrics for objective grading of colorectal adenocarcinoma histology images. Scientific Reports 7 (1), pp. 16852. Cited by: §3.
  • [2] H. Chen, X. Qi, L. Yu, and P. Heng (2016) DCAN: deep contour-aware networks for accurate gland segmentation. In

    IEEE International Conference on Computer Vision (ICCV)

    pp. 2487–2496. Cited by: §1, §3.
  • [3] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In European Conference on Computer Vision (ECCV), Cited by: §2.
  • [4] W. Cui, Y. Liu, Y. Li, M. Guo, Y. Li, X. Li, T. Wang, X. Zeng, and C. Ye (2019) Semi-supervised brain lesion segmentation with an adapted mean teacher model. In International Conference on Information Processing in Medical Imaging (IPMI), pp. 554–565. Cited by: §1.
  • [5] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88 (2), pp. 303–338. Cited by: §2.
  • [6] A.F. Gazdar and A. Maitra (2001) Adenocarcinomas. In Encyclopedia of Genetics, S. Brenner and J. H. Miller (Eds.), pp. 9 – 12. External Links: ISBN 978-0-12-227080-2, Document, Link Cited by: §1.
  • [7] S. Graham, H. Chen, J. Gamper, Q. Dou, P. Heng, D. Snead, Y. W. Tsang, and N. Rajpoot (2019) MILD-net: minimal information loss dilated network for gland instance segmentation in colon histology images. Medical Image Analysis 52, pp. 199–211. Cited by: §1, §1, §3, §3.
  • [8] S. Graham, D. Epstein, and N. Rajpoot (2019) Rota-net: rotation equivariant network for simultaneous gland and lumen segmentation in colon histology images. In European Congress on Digital Pathology, pp. 109–116. Cited by: §3.
  • [9] J. Jeong, S. Lee, J. Kim, and N. Kwak (2019) Consistency-based semi-supervised learning for object detection. In Advances in Neural Information Processing Systems (NeurIPS), pp. 10758–10767. Cited by: §1.
  • [10] Z. Ke, D. Wang, Q. Yan, J. Ren, and R. W. Lau (2019) Dual student: breaking the limits of the teacher in semi-supervised learning. In IEEE International Conference on Computer Vision (ICCV), pp. 6728–6736. Cited by: §1.
  • [11] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §2.
  • [12] A. Oliver, A. Odena, C. A. Raffel, E. D. Cubuk, and I. Goodfellow (2018) Realistic evaluation of deep semi-supervised learning algorithms. In Advances in Neural Information Processing Systems (NeurIPS), pp. 3235–3246. Cited by: §1.
  • [13] H. Qu, Z. Yan, G. M. Riedlinger, S. De, and D. N. Metaxas (2019) Improving nuclei/gland instance segmentation in histopathology images by full resolution neural network and spatial constrained loss. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 378–386. Cited by: §1, §3.
  • [14] K. Sirinukunwattana, J. P. Pluim, H. Chen, X. Qi, P. Heng, Y. B. Guo, L. Y. Wang, B. J. Matuszewski, E. Bruni, U. Sanchez, et al. (2017) Gland segmentation in colon histology images: the glas challenge contest. Medical Image Analysis 35, pp. 489–502. Cited by: §1, §3, §3.
  • [15] K. C. Wong, M. Moradi, H. Tang, and T. Syeda-Mahmood (2018) 3D segmentation with exponential logarithmic loss for highly unbalanced object sizes. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 612–619. Cited by: §2.
  • [16] Y. Xie, H. Lu, J. Zhang, C. Shen, and Y. Xia (2019) Deep segmentation-emendation model for gland instance segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 469–477. Cited by: §1, §1, §3, §3.
  • [17] Y. Xie, J. Zhang, and Y. Xia (2019) Semi-supervised adversarial model for benign–malignant lung nodule classification on chest ct. Medical Image Analysis 57, pp. 237–248. Cited by: §1.
  • [18] Y. Xu, Y. Li, Y. Wang, M. Liu, Y. Fan, M. Lai, I. Eric, and C. Chang (2017) Gland instance segmentation using deep multichannel neural networks. IEEE Transactions on Biomedical Engineering 64 (12), pp. 2901–2912. Cited by: §1.
  • [19] Z. Yan, X. Yang, and K. Cheng (2020) Enabling a single deep learning model for accurate gland instance segmentation: a shape-aware adversarial learning framework. IEEE Transactions on Medical Imaging (), pp. 1–1. External Links: Document, ISSN 1558-254X Cited by: §1, §1, §3.
  • [20] X. Zhai, A. Oliver, A. Kolesnikov, and L. Beyer (2019) S4l: self-supervised semi-supervised learning. In IEEE International Conference on Computer Vision (ICCV), pp. 1476–1485. Cited by: §1.
  • [21] J. Zhang, Y. Xie, P. Zhang, H. Chen, Y. Xia, and C. Shen (2019) Light-weight hybrid convolutional network for liver tumour segmentation. In

    International Joint Conference on Artificial Intelligence (IJCAI)

    pp. 10–16. Cited by: §2.