Exploring Feature Representation Learning for Semi-supervised Medical Image Segmentation

by   Huimin Wu, et al.

This paper presents a simple yet effective two-stage framework for semi-supervised medical image segmentation. Our key insight is to explore the feature representation learning with labeled and unlabeled (i.e., pseudo labeled) images to enhance the segmentation performance. In the first stage, we present an aleatoric uncertainty-aware method, namely AUA, to improve the segmentation performance for generating high-quality pseudo labels. Considering the inherent ambiguity of medical images, AUA adaptively regularizes the consistency on images with low ambiguity. To enhance the representation learning, we propose a stage-adaptive contrastive learning method, including a boundary-aware contrastive loss to regularize the labeled images in the first stage and a prototype-aware contrastive loss to optimize both labeled and pseudo labeled images in the second stage. The boundary-aware contrastive loss only optimizes pixels around the segmentation boundaries to reduce the computational cost. The prototype-aware contrastive loss fully leverages both labeled images and pseudo labeled images by building a centroid for each class to reduce computational cost for pair-wise comparison. Our method achieves the best results on two public medical image segmentation benchmarks. Notably, our method outperforms the prior state-of-the-art by 5.7 segmentation relying on just 5



There are no comments yet.


page 1

page 4

page 8


Self-Ensembling Contrastive Learning for Semi-Supervised Medical Image Segmentation

Deep learning has demonstrated significant improvements in medical image...

Enhancing Pseudo Label Quality for Semi-SupervisedDomain-Generalized Medical Image Segmentation

Generalizing the medical image segmentation algorithms tounseen domains ...

Contrastive Rendering for Ultrasound Image Segmentation

Ultrasound (US) image segmentation embraced its significant improvement ...

Proactive Pseudo-Intervention: Causally Informed Contrastive Learning For Interpretable Vision Models

Deep neural networks have shown significant promise in comprehending com...

Medical Image Segmentation and Localization using Deformable Templates

This paper presents deformable templates as a tool for segmentation and ...

Cross-level Contrastive Learning and Consistency Constraint for Semi-supervised Medical Image Segmentation

Semi-supervised learning (SSL), which aims at leveraging a few labeled i...

Contrastive Learning for Mitochondria Segmentation

Mitochondria segmentation in electron microscopy images is essential in ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Medical image segmentation is a foundational task for computer-aided diagnosis and computer-aided surgery. In recent years, considerable efforts have been devoted to design neural networks for medical image segmentation, such as U-Net 

[ronneberger2015u], DenseUNet [li2018h], nnUNet [isensee2020nnu], HyperDenseNet [dolz2018hyperdense]

. However, training these models requires a large amount of labeled images. Unlike natural images, the professional expertise required for pixel-wise manual annotation of medical images makes such labeling tasks challenging and time-consuming, resulting in difficulty of obtaining a large labeled dataset. Hence, semi-supervised learning, which enables training using labeled and unlabeled data, becomes an active research area for medical image segmentation.

Figure 1: Two toy examples in which (a) visualizes the feature space of an indiscriminative semi-supervised model, and (b) visualizes the feature space of a well-clustered semi-supervised model.

A common assumption of semi-supervised learning is that the decision boundary should not pass through high-density regions. Consistency regularization-based techniques 

[DBLP:conf/miccai/YuWLFH19, luo2021urpc, li2020transformation] achieve a decision boundary at a low-density area by penalizing prediction variation under different input perturbations. Entropy minimization-based methods aim to achieve high-confidence predictions for unlabeled data either in an explicit manner [DBLP:conf/nips/GrandvaletB04] or an implicit manner [lee2013pseudo, DBLP:conf/miccai/SedaiARJ0SWG19, DBLP:journals/tmi/FanZJZCFSS20, reiss2021every]. As shown in Figure 1, an ideal model should pull together data points of the same class and push apart data points from different classes in the feature space. As the training set of semi-supervised learning includes labeled and unlabeled images, it is challenging to directly optimize the unlabeled images in the feature space without explicit guidance. We observe that with unlabeled images, most semi-supervised methods [DBLP:conf/miccai/YuWLFH19, luo2021urpc, li2020transformation] can achieve more accurate segmentation results than the model trained with only labeled data. Therefore, the pseudo segmentation predicted by a semi-supervised model on unlabeled data could possibly be made even more stable and precise.

Motivated by this observation, we present a simple yet effective two-stage framework for semi-supervised medical image segmentation with the key idea to explore representation learning for segmentation from both labeled and unlabeled images. The first stage aims to generate high-quality pseudo labels, and the second stage aims to use pseudo labels to retrain the network to regularize features for both labeled and unlabeled images. Existing uncertainty-based semi-supervised methods [DBLP:conf/miccai/YuWLFH19, DBLP:journals/tmi/CaoCLPWC21, DBLP:conf/miccai/WangZTZSZH20] have achieved stunning results by considering the reliability of the supervision for the unlabeled images. These methods exploit the epistemic uncertainty, a kind of uncertainty about the model’s parameters arising from a lack of data, either in the output space [DBLP:conf/miccai/YuWLFH19, DBLP:journals/tmi/CaoCLPWC21, DBLP:conf/miccai/WangZTZSZH20] or in the feature space [DBLP:conf/miccai/WangZTZSZH20], as guidance for identifying trustworthy supervision. Medical images are often noisy, and the boundaries between tissue types may not be well defined, leading to a disagreement among human experts [DBLP:conf/nips/KohlRMFLMERR18, DBLP:conf/miccai/BaumgartnerTCHM19, DBLP:conf/nips/MonteiroFCPMKWG20]. However, aleatoric uncertainty that represent the ambiguity about the input data and is irreducible by obtaining more data, is ignored in these methods.

To obtain high-quality pseudo labels for unlabeled images, we present an Aleatoric Uncertainty Adaptive method, namely AUA, for semi-supervised medical image segmentation. Under the framework of the mean teacher model [tarvainen2017mean]

, to obtain reliable target supervision for unlabeled data, instead of estimating the model’s epistemic uncertainty 

[DBLP:conf/miccai/YuWLFH19, DBLP:journals/tmi/CaoCLPWC21, DBLP:conf/miccai/WangZTZSZH20]

, we explore the aleatoric uncertainty of the model for noisy input data. AUA first measures the spatially correlated aleatoric uncertainty by modeling a multivariate normal distribution over the logit space. To effectively utilize unlabeled images, AUA encourages the prediction consistency between the teacher model and the student model by adaptively considering the aleatoric uncertainty for each image. Specifically, the consistency regularization automatically emphasizes the input images with lower aleatoric uncertainty,

i.e., input images with less ambiguity.

In the second stage, we retrain the network with pseudo labels. To effectively regularize feature representation learning in both stages, we propose stage-adaptive feature regularization, including a boundary-aware contrastive loss in the first stage and a prototype-aware contrastive loss in the second stage. The main idea of boundary-aware contrastive loss is to fully leverage labeled images for representation learning. A straightforward solution is to pull together the pixels to the same class and push away pixels from different classes using a contrastive loss. However, medical images usually contain a large number of pixels. Simply utilizing contrastive loss would lead to a high computational cost and memory consumption. To this end, we present a boundary-aware contrastive loss, where only randomly sampled pixels from the segmentation boundary are optimized. In the second stage, to effectively utilize both labeled and pseudo-labeled images, i.e., unlabeled images for representation learning, we present a prototype-aware contrastive loss with each pixel’s feature pulled closer to its class centroid, i.e., prototype, and pushed further away from the class centroids it does not belong to. The main intuition is that the trained model can generate pseudo labels for unlabeled images in the second stage. Compared with the boundary-aware contrastive loss, the prototype-aware contrastive loss better leverages the pseudo labels, especially those that may not occur at the segmentation boundaries.

In summary, this paper makes the following contributions:

  • We present AUA, an aleatoric uncertainty adaptive consistency regularization method where the student model can learn from the teacher model in an adaptive manner according to the estimated aleatoric uncertainty.

  • We introduce a stage-aware method to explore feature representation learning in a semi-supervised setting. A boundary-aware contrastive loss is developed to enhance the segmentation with only labeled images and a prototype-aware contrastive loss is proposed to improve the result with both labeled and pseudo labeled images.

  • Our method achieves the state-of-the-art performance on two public datasets. Ablation study validates the effectiveness of our proposed AUA and feature representation methods. Our code will be released at GitHub https://github.com/XMed-Lab/FRL_SemiMedSeg upon acceptance.

Ii Related Work

We briefly discuss related works in semi-supervised medical image segmentation, including pseudo labeling and consistency regularization. We also discuss some techniques related to contrastive learning and uncertainty estimation.

Ii-a Semi-supervised Medical Image Segmentation

Semi-supervised learning (SSL) refers to train the model with both labeled and unlabeled images. For medical image segmentation, Early work used graph-based methods [DBLP:journals/tmi/SuYHKZ16, DBLP:conf/icpr/BorgaAL16]

for semi-supervised segmentation. Recently, semi-supervised medical image segmentation has featured deep learning. The existing methods can be broadly classified into two categories: pseudo labeling-based 

[lee2013pseudo, reiss2021every, DBLP:conf/cvpr/XieLHL20, DBLP:journals/corr/abs-2012-00827, DBLP:journals/mia/XiaYYLCYZXYR20] and consistency regularization-based methods [DBLP:conf/miccai/YuWLFH19, luo2021urpc, li2020transformation, DBLP:journals/tmi/CaoCLPWC21, DBLP:conf/miccai/WangZTZSZH20, DBLP:conf/miccai/HangFLYWCQ20, DBLP:conf/miccai/FangL20, DBLP:journals/corr/abs-2103-02911, luo2021semi, li2018semi, DBLP:conf/miccai/BortsovaDHKB19, DBLP:conf/miccai/LiZH20, DBLP:conf/miccai/YangSKW20].

Pseudo Labeling-based Methods. Pseudo labeling-based methods handle label scarcity via estimating pseudo labels on unlabeled data and using all the labeled and pseudo labeled data to train the model. Self-training is one of the most straightforward solutions  [lee2013pseudo, DBLP:conf/cvpr/XieLHL20, DBLP:journals/corr/abs-2012-00827] and has been extended to the biomedical domain for segmentation [DBLP:conf/miccai/SedaiARJ0SWG19, DBLP:journals/tmi/FanZJZCFSS20, DBLP:conf/miccai/BaiOSSRTGKMR17]. The main idea of self-training is that the model is first trained with labeled data only and then generate pseudo labels for unlabeled data. By retraining the model with both labeled and pseudo labeled images, the model performance can be enhanced. The model can be trained iteratively with these two processes until the model performance becomes stable and satisfactory. To reduce the noise in pseudo labels, different methods have been developed, including identifying trustworthy pseudo labels by uncertainty estimation [DBLP:conf/miccai/SedaiARJ0SWG19], using a conditional random field (CRF) [DBLP:conf/nips/KrahenbuhlK11] to refine pseudo labels [DBLP:conf/miccai/BaiOSSRTGKMR17] or using pseudo labels only for fine-tuning [DBLP:journals/tmi/FanZJZCFSS20]. In addition to such an offline pseudo label generation strategy, online self-training methods [DBLP:conf/miccai/LiCXMZ20, reiss2021every] have been developed recently where pseudo labels are generated after each forward propagation and used as an immediate supervision.

Another pseudo labeling-based method is co-training  [DBLP:conf/colt/BlumM98, DBLP:conf/eccv/QiaoSZWY18, DBLP:conf/ijcai/ChenWGZ18] where multiple learners are trained and their disagreement on unlabeled data is exploited for improving accuracy of pseudo labels. The basic idea is that each learner could learn different and complementary information from the other learners. In some self-training methods, more than one learner can be used, such as in [reiss2021every] and the supervision on unlabeled data is unidirectional. For example, the teacher model [tarvainen2017mean] generates pseudo labels to supervise the student model, while in a dual-model co-training method such as [DBLP:journals/mia/XiaYYLCYZXYR20], supervision is bidirectional. Specifically, each base model’s supervision of unlabeled data is based on the fused predictions from the other base models, weighted by the confidence of each model.

However, these methods ignore the class-aware feature regularization, which is a key focus of this study. We will demonstrate the importance of feature representation learning in learning with labeled and pseudo labeled images.

Consistency regularization-based Methods. The goal of consistency regularization-based semi-supervised methods [tarvainen2017mean, DBLP:conf/iclr/LaineA17, DBLP:journals/pami/MiyatoMKI19] is to find the model that is not only accurate in predictions but also invariant to input perturbations to enforce the decision boundary traverse the low-density region of the feature space. One line of these methods considers invariance to input domain perturbations. For example, the temporal ensembling model [DBLP:conf/iclr/LaineA17] achieves promising results by accumulating soft pseudo labels on randomly perturbated input images. An extension work with soft pseudo label accumulation guided by epistemic uncertainty was proposed in [DBLP:journals/tmi/CaoCLPWC21]. When epistemic uncertainty of the prediction is high, it will contribute less to pseudo label accumulation. The mean teacher model [tarvainen2017mean] achieves invariance to input perturbations by promoting consistency between the predictions of the teacher and the student models where input images fed to the teacher model are added with noises. Extensions have also been made from the perspective of reliability evaluation [DBLP:conf/miccai/YuWLFH19, DBLP:conf/miccai/WangZTZSZH20] to provide reliable supervision from the teacher model to the student model or considering structural information of foreground objects [DBLP:conf/miccai/HangFLYWCQ20].

In addition to input domain perturbation, other perturbations that would not change the semantics of the prediction have also been designed and used to promote consistency. For example, a consistency among predictions given by differently designed decoders [DBLP:conf/miccai/FangL20, DBLP:journals/corr/abs-2103-02911], or at different scales [luo2021urpc] or with different modalities [luo2021semi] where perturbations are beyond the input level should be maintained and should lead to semantics-invariance. Aside from perturbations that lead to invariance in output, there is another line of studies [li2018semi, li2020transformation, DBLP:conf/miccai/BortsovaDHKB19] that promotes equavariance between the input and the output because some input space transform, especially spatial transform such as rotations, should lead to the same transform in the output space.

Unlike these existing methods that are based on consistency regularization, our method is a two-stage framework, which improves the overall framework by regularizing the feature representation. Moreover, we introduce AUA, an aleatoric uncertainty-aware method, to represent inherent ambiguities in medical images and enhance the segmentation performance by encouraging the consistency for images with low ambiguity.

Ii-B Contrastive Learning in Semi-supervised Image Segmentation.

Note that we exclude self-supervised learning methods where unlabeled data are only used for task-agnostic purpose, i.e., pretraining such as in [DBLP:conf/nips/ChaitanyaEKK20], even though performance under semi-supervised setting is also reported. We only consider contrastive learning for task-specific use [DBLP:conf/brainles-ws/IwasawaHS20, lai2021semi, DBLP:journals/corr/abs-2012-06985]. Among these works, only [DBLP:journals/corr/abs-2012-06985]’s goal is to promote inter-class separation and intra-class compactness. However, in [DBLP:journals/corr/abs-2012-06985], pseudo labels are obtained from the model trained with labeled data only, whose performance is inferior to our first stage model, where pseudo labels are obtained from a model that takes advantage of consistency regularization on unlabeled data and feature regularization on labeled data. In [lai2021semi], inter-class separation is considered by taking pixels with different pseudo labels as negative pairs, but intra-class compactness is ignored since the positive pair is built on the same pixels from different crops, which essentially is an extension of instance discrimination for the segmentation task. To the best of our knowledge, ours is the first study with pixel-level feature regularization aiming at intra-class compactness and inter-class separation for semi-supervised medical image segmentation.

Ii-C Uncertainty Estimation in Semi-supervised Medical Image Segmentation.

Uncertainties generally fall into two categories: epistemic and aleatoric. Epistemic uncertainty is about a model’s parameters caused by a lack of data while aleatoric uncertainty is caused by intrinsic ambiguities or randomness of input data and cannot be reduced by introducing more data. Early methods measure uncertainty using particle filtering and CRFs [DBLP:journals/ijcv/BlakeCZ93, DBLP:conf/cvpr/HeZC04]

. More recently, in Bayesian networks, epistemic uncertainty is usually estimated with Monte Carlo Dropout 

[DBLP:conf/nips/KendallG17], which has been extended for the semi-supervised medical image segmentation task [DBLP:conf/miccai/YuWLFH19, DBLP:journals/tmi/CaoCLPWC21, DBLP:conf/miccai/WangZTZSZH20]. Aleatoric uncertainty is estimated either without considering correlations between pixels [DBLP:conf/nips/KendallG17] or with a limited ability to model spatial correlation since it is captured by uncorrelated latent variables from multivariate normal distribution [DBLP:conf/nips/KohlRMFLMERR18, DBLP:conf/miccai/BaumgartnerTCHM19]. Monteiro et al. [DBLP:conf/nips/MonteiroFCPMKWG20] proposed an aleatoric uncertainty estimation technique where correlations between pixels are considered. Given the ubiquitous existence of noises or ambiguities in medical images, aleatoric uncertainty has been overlooked for semi-supervised medical image segmentation. In this work, we propose an aleatoric uncertainty adaptive consistency regularization technique, where correlations between pixels are considered when measuring aleatoric uncertainty.

Iii Method

Figure 2: Overview of our method. The first stage training is composed of an aleatoric uncertainty-adaptive (AUA) method to optimize output consistency and a boundary-aware contrastive loss (BCL) to regularize labeled images on the feature space, aiming at generating high-quality pseudo labels. In the second stage, the network is retrained with both labeled images and pseudo labeled images, where a prototype-aware contrastive loss (PCL) is used to fully leverage the both labeled and pseudo labeled images for representation learning for image segmentation.

Figure 2 visualizes the overview of the proposed two-stage framework. The input image is first fed into the AUA module to get a segmentation model which generates a high-quality pseudo label. Then, we introduce the stage-adaptive contrastive learning method, consisting of a boundary-aware contrastive loss (BCL) on labeled data only in the first stage and a prototype-aware contrastive loss (PCL) on all data at the second stage. By sequentially training through the first and second stages, we generate the final segmentation results.

Iii-a Aleatoric Uncertainty Adaptive Consistency Regularization (AUA)

Under semi-supervised settings where limited labeled data are available, the medical image segmentation model could make unreliable predictions on unlabeled data. It is a desirable property if the segmentation model can be aware of its chance of making mistakes. Aleatoric uncertainty, the kind of uncertainty arising from ambiguities in input data that hinders the segmentation performance, can serve as an indicator reflecting when a model may not perform well. While image ambiguity is a significant issue for medical image segmentation, this effect has not been taken into account in previous semi-supervised medical segmentation studies. This work proposes a new consistency regularization technique where aleatoric uncertainty is used to guide how much the student model should learn from the teacher model.

We first introduce how to estimate aleatoric uncertainty in a stochastic segmentation network. Here, we consider a -class segmentation task on 3D volumes with size , where , , denote the height, width and depth, respectively. Given an image and its ground truth

with the same size, the loss function of a general segmentation network is designed to minimize the negative log-likelihood, formulated as:


where denotes logits.

In a deterministic segmentation network, i.e., assuming and independence of each pixel’s prediction on the other, where is a neural network parameterized by and denotes Dirac delta function, the loss function in Eq. 1 can be rewritten as:


For simplicity, we use a one-dimensional scalar to index each pixel out of a whole set of pixels in a 3D volume. The above equation is the cross-entropy function commonly used in segmentation models.

In a stochastic segmentation network, to represent inter-dependence among pixels and inherent ambiguities of input data, we follow [DBLP:conf/nips/MonteiroFCPMKWG20]

and assume a multi-variant Gaussian distribution around logits,

i.e., , with and . Monte-Carlo integration of samples is applied to approximate the intractable integral operation, leading us from Eq. 1 to:


The logsumexp operation is for numerical stability and we refer the calculation of to Eq. 2, where is a sample out of . As pointed out by [DBLP:conf/nips/MonteiroFCPMKWG20], the full-rank covariance matrix is computationally infeasible, so we also adopt a low-rank approximation defined as:


where denotes the factor part of a low-rank form of the covariance matrix and denotes the diagonal part. In this way, the computational complexity can be reduced from quadratic to linear with and where denotes the rank.

Given an unlabeled image , the predicted distribution by the student model parameterized by is denoted as . Similarly, we can obtain the teacher model’s prediction over the perturbed version of the same input by Gaussian noise injection, where parameters of the teacher model, denoted as , are updated with an exponential moving average of the parameters of the student model. The consistency between the teacher model’s predictions and the student model’s predictions on unlabeled data is encouraged by minimizing the generalised energy distance [DBLP:conf/nips/KohlRMFLMERR18, szekely2013energy], which is defined as:


To approximate the intractable expectation operation in Eq. 6, we take samples out of and , respectively. The consistency regularization loss function can be reformulated as:


In Eq. 7, is defined as the Generalized Dice loss [sudre2017generalised]:


where indexes each pixel out of a whole set of pixels in a 3D volume and indexes each class out of a total of classes.

The optimum of Eq. 7 is 0, which means the optimum of the first term is the sum of the last two. This consistency regularization metric is adaptive to aleatoric uncertainty in the sense that if the diversity between samples of the student (or the teacher) model is high, i.e., the values of the last two terms of Eq. 7 are large, indicating a high aleatoric uncertainty, the pairwise similarity of samples from the student and the teacher models, denoted by the first term of Eq. 7, would be less strictly constrained. On the contrary, on an input data where a low diversity is estimated, implying the aleatoric uncertainty is low and the model is more likely to generalize well, the student model automatically learns more from the teacher model by optimizing the first term to a smaller value.

To summarize, AUA loss is defined as follows:


where is the scaling weight to balance the uncertainty estimation loss and the generalised energy distance loss.

Iii-B Stage-adaptive feature regularization

We introduce a stage-adaptive feature learning method consisting of a boundary-aware contrastive loss and a prototype-aware contrastive loss, to enhance the representation learning with only labeled images and both labeled and pseudo labeled images, respectively. A natural solution is contrastive loss with features of pixels belonging to the same class (i.e., both foreground pixels or both background pixels) as positive pairs and features belonging to different classes (i.e., one from foreground the other from background) as negative pairs. This strategy allows us to perform pixel-wise regularization but consumes memory quadratically to the number of pixels, so we propose a stage-adaptive constrastive learning method with these concerns properly handled. To reduce the computational cost, at the first stage, we only optimize the feature representation for pixels around the segmentation boundaries, using a boundary-aware contrastive loss (BCL). At the second stage, with more accurate pseudo labels on unlabeled data, we introduce a prototype-aware contrastive loss (PCL) to fully leverage both labeled and pseudo labeled images for representation learning.

Iii-B1 Boundary-aware contrastive learning

As a balance of benefits of pixel-wise feature level regularization and computational costs, we build positive and negative pairs based on a random subset of near-boundary pixels, arriving at the boundary-aware contrastive loss formally defined as:


where contains indexes of randomly sampled near boundary pixels from an input image, contains indexes of the other pixels except pixel and contains indexes of pixels in belonging to the same class as pixel

. The feature vectors

, and are obtained from a 3-layer convolutional projection head, which is connected after the layer before the last layer. The temperature is set to be 0.07 by following [DBLP:conf/nips/KhoslaTWSTIMLK20].

Iii-B2 Prototype-aware contrastive learning

At the second stage, the way to regularize an indiscriminative feature space as in Figure 1(a) is to encourage each feature to be closer to any other pixels that share the same label and further away from the centroid of opposite class so that forming a feature space in Figure 1(b) is encouraged, which is defined as:


where contains indexes of all pixels. and contains the indexes of positive pixels, i.e., those sharing to the same class, and negative pixels , i.e., those with different labels to pixel

, respectively. Features extracted from the second stage model are denoted as

where can be an index of any pixel.

In [DBLP:journals/corr/abs-2105-05013], by assuming a Gaussian distribution for features belonging to each class, the computational cost of Eq. 11 can be reduced from quadratic to linear, leading to a regularization formulated as:


where and are mean and covariance matrix of the positive class to pixel and similarly, and are mean and covariance matrix of the negative class corresponding to pixel . These prototype statistics for each class are estimated from the first stage model with an moving average update of extracted features with each update at -step formulated as:


where denotes the total number of pixels belonging to class seen before time step , and denotes pixel number of class in the loaded image at time step . and denote the mean and covariance, respectively, of features belonging to class in images at . We refer readers to [DBLP:journals/corr/abs-2105-05013] for detailed derivation. The final prototypes are estimated after 3000 iterations and the temperature is set to be 100 by following [DBLP:journals/corr/abs-2105-05013].

Iii-C Stage-wise Training as a Unified Framework

To summarize, in the first stage, the loss function is defined as:


where is the scaling weight for BCL loss. To this end, pseudo labels on unlabeled data with a higher quality can be obtained thanks to joint prediction regularization (with AUA) and feature regularization (with BCL), which enables retaining a stronger segmentation model at the second stage by regularizing both predictions and features over the whole dataset in a label-aware manner. The loss function at the second stage is as follows:


where is defined as the average of cross-entropy loss and Dice loss as a common practice in segmentation, which serves as pseudo labeling and is the weight for PCL loss.

Method scans used Metrics
Labeled Unlabeled Dice Jaccard ASD[voxel] 95HD[voxel]
V-Net 12 0 70.63 56.72 6.29 22.54
V-Net 62 0 81.78 69.65 1.34 5.13
MT [tarvainen2017mean] 12 50 75.85 61.98 3.40 12.59
DAN [DBLP:conf/miccai/ZhangYCFHC17] 12 50 76.74 63.29 2.97 11.13
Entropy Mini[DBLP:conf/cvpr/VuJBCP19] 12 50 75.31 61.73 3.88 11.72
UA-MT [DBLP:conf/miccai/YuWLFH19] 12 50 77.26 63.82 3.06 11.90
CCT [DBLP:conf/cvpr/OualiHT20] 12 50 76.58 62.76 3.69 12.92
SASSNet [DBLP:conf/miccai/LiZH20] 12 50 77.66 64.08 3.05 10.93
DTC [luo2021semi] 12 50 78.27 64.75 2.25 8.36
Ours 12 48 79.81 66.82 1.64 5.90
Table I: A comparison with state-of-the-art on pancreas dataset with 20% labeled data. The up arrow implies that the larger the number, the better the performance. The down arrow implies that a lower number indicates a better performance.
Method scans used Metrics
Labeled Unlabeled Dice Jaccard ASD[voxel] 95HD[voxel]
V-Net 3 0 30.74 18.84 6.97 26.45
V-Net 60 0 81.46 69.18 1.31 5.09
MT[tarvainen2017mean] 3 57 31.09 18.77 28.14 59.22
UA-MT[DBLP:conf/miccai/YuWLFH19] 3 57 34.46 21.24 25.73 57.40
DTC[luo2021semi] 3 57 48.47 32.71 17.03 42.61
SASSNet[DBLP:conf/miccai/LiZH20] 3 57 51.96 36.03 16.08 45.36
Ours 3 57 56.18 40.05 12.47 34.85
Table II: A comparison with state-of-the-art on Pancreas dataset with 5% labeled data.
Method scans used Metrics
Labeled Unlabeled Dice Jaccard ASD[voxel] 95HD[voxel]
V-Net 5 0 34.07 23.09 10.12 26.52
V-Net 100 0 62.31 49.47 2.14 13.49
MT[tarvainen2017mean] 5 95 38.64 26.38 14.41 33.08
UA-MT[DBLP:conf/miccai/YuWLFH19] 5 95 40.61 28.01 15.31 34.92
SASSNet[DBLP:conf/miccai/LiZH20] 5 95 41.64 30.07 11.93 28.96
DTC[luo2021semi] 5 95 43.29 29.84 10.62 26.22
Ours 5 95 49.00 35.15 9.04 22.32
Table III: A comparison with state-of-the-art on Colon tumor dataset with 5% labeled data.

Iv Experimental Results

Iv-a Datasets and Preprocessing

Pancreas CT dataset. Pancreas CT dataset [DBLP:conf/miccai/RothLFSLTS15] is a public dataset containing 80 scans with a resolution of 512512 pixels and slice thickness between 1.5 and 2.5 mm. Each image has a corresponding pixel-wise label, which is annotated by an expert and verified by a radiologist.

Colon cancer segmentation dataset. Colon cancer dataset is a subset from Medical Segmentation Decathlon (MSD) datasets [simpson2019large], consisting of 190 colon cancer CT volumes. Pixel-level label annotations are given on 126 CT volumes. Among these volumes, we randomly split 26 CT volumes as a test set and use the rest for training.

Preprocessing. To fairly compare with other methods, we follow preprocessing in [luo2021semi] by clipping CT images to a range of [125, 275] HU values, resampling images to 111

resolution, center-cropping both raw images and annotations around foreground area with a margin of 25 voxels and finally normalizing raw images to zero mean and unit variance. On the Pancreas dataset, we apply random crop as an augmentation on the fly, and the Colon dataset is augmented with random rotation, random flip, and random crop. On both datasets,

sub-volumes are randomly cropped from raw data and fed to the segmentation model for training.

Iv-B Implementation Details


All experiments in this work are implemented in Pytorch 1.6.0 and conducted under python 3.7.4 running on an NVIDIA TITAN RTX.

Backbone. VNet [DBLP:conf/3dim/MilletariNA16] is used as our backbone where the last convolutional layer is replaced by a 3D 111 convolutional layer. On top of that, a projection module and an aleatoric uncertainty module are built for feature regularization and aleatoric uncertainty estimation, respectively. Similar to [DBLP:journals/corr/abs-2012-06985]

, the projection head constitutes 3 convolutional layers, each followed by ReLU activations and batch normalization, except for the last layer, which is followed by a unit-normalization layer. The channel size of each convolutional layer is set as 16. The aleatoric uncertainty module is comprised of three 1-layer branches predicting means, covariance factors, and covariance diagonals respectively.

Training details. Our model is trained with an SGD optimizer with 0.9 momentum and 0.0001 weight decay for 6000 iterations. A step decay learning rate schedule is applied where the initial learning rate is set to be 0.01 and dropped by 0.1 every 2500 iterations. For each iteration, a training batch containing two labeled and two unlabeled sub-volumes is fed to the proposed model, with each sub-volume randomly cropped with the size of 9696

96. On the test set, predictions on sub-volumes with the same size using a sliding window strategy with a stride of 16

1616 are fused to obtain the final results.

Evaluation metrics. We use Dice (DI), Jaccard (JA), the average surface distance (ASD), and the 95 Hausdorff Distance (95HD) to evaluate the effectiveness of our semi-supervised segmentation method.

Iv-C Results on Pancreas Dataset

Since the predictions on unlabeled data may be inaccurate in the early stage of training, we follow common practices [DBLP:conf/miccai/YuWLFH19, luo2021semi] and use a Gaussian ramping up function to control the strength of consistency regularization, where denotes current time step and denotes the maximal training step, i.e., 6000 as introduced previously. The constant used to scale BCL, i.e., is set to be 0.09 given labeled data and 0.01 given labeled data. In the second stage of training, the PCL weight is set to be 0.1.

Table I shows the results on the pancreas dataset. Previous methods are benchmarked on the first version of Pancreas dataset with 12 labeled volumes and 50 unlabeled volumes, where, however, 2 duplicates of scan 2 are found. In case some of these three samples are in the training set and the rest are in the test set after a random split, we use version 2 where two duplicates are removed, leaving us the same number of labeled data but 2 less, i.e., 48 unlabeled data. Even under a more strict scenario, our proposed model achieves the best performance among existing works.

The first row, i.e., a fully supervised baseline on partial dataset, shows the lower bound of the semi-supervised segmentation methods. Whereas the second row, i.e., a fully supervised model on a fully labeled dataset, shows the upper bound performance. We can observe that our method achieves 79.81% on Dice, surpassing the current state-of-the-art by 1.54%. Notably, our method is very close to the fully-supervised model that employs all volumes supervised by human annotations, showing that the effectiveness of the proposed semi-supervised method.

To further validate our method under a more challenging scenario, we reduce the number of labeled data to only 5% and use the rest 95% as unlabeled. As shown in Table II, in such a small-data regime, a performance drop of every semi-supervised learning method is observed compared to its counterpart in a big-data regime in Table I where 20% labeled are available, which confirms the common sense. It is observed that our method consistently outperforms other methods. Specifically, our method surpasses all the other semi-supervised methods and outperforms the current state-of-the-art by 4.22% on Dice, which demonstrates that the effectiveness of our method is more obvious in a more challenging setting.

Method Metrics
Dice Jaccard ASD[voxel] 95HD[voxel]
Supervised baseline 70.63 56.72 6.29 22.54
AUA 76.13 62.19 2.25 9.35
AUA + BCL (First stage) 77.15 63.34 2.04 7.00
AUA + BCL + Pseudo labeling 79.08 65.91 1.91 6.69
AUA + BCL + Pseudo labeling + PCL (Full) 79.81 66.82 1.64 5.90
Table IV: Ablation study on Pancreas dataset. BCL refers to boundary-aware contrastive learning and PCL refers to prototype-aware contrastive learning. Pseudo labeling refers to directly retraining the network with pseudo labels without PCL.
Method Metrics
Dice Jaccard ASD[voxel] 95HD[voxel]
Supervised baseline 34.07 23.09 10.12 26.52
AUA 42.74 30.20 15.00 35.43
AUA + BCL (First stage) 43.70 30.92 14.74 33.34
AUA + BCL + Pseudo labeling 46.75 33.62 12.39 28.49
AUA + BCL + Pseudo labeling + PCL (Full) 49.00 35.15 9.04 22.32
Table V: Ablation study on Colon dataset. BCL refers to boundary-aware contrastive learning and PCL refers to prototype-aware contrastive learning. Pseudo labeling refers to directly retraining the network with pseudo labels without PCL.
Figure 3: A comparison between visualized segmentation maps obtained by the state-of-the-arts [tarvainen2017mean, DBLP:conf/miccai/YuWLFH19, DBLP:conf/miccai/LiZH20, luo2021semi] and our method. Regions highlighted in red are true positive areas, i.e., pixels correctly predicted. Regions highlighted in green and blue are false negatives, i.e., foreground pixels incorrectly predicted as background, and false positives, i.e., background pixels incorrectly predicted as foreground.

Iv-D Results on Colon Dataset

Table III shows the results on the colon dataset. To get a result, we set to be , and the scaling weight of BCL, i.e., is set to be 0.03. In the second stage of training, the weight PCL is set to be 0.1. We compare our method with several state-of-the-art methods including MT [tarvainen2017mean], UA-MT [DBLP:conf/miccai/YuWLFH19], SASSNet [DBLP:conf/miccai/LiZH20] and DTC [luo2021semi] using data as labeled and the rest as unlabeled. Again, we tune hyper-parameters for previous methods so that these methods can reach the best performance on this dataset. In Table III, by comparing the second row with Table II, we notice under fully supervised setting using full dataset, the performance on Colon dataset is lower than Pancreas dataset, indicating Colon dataset is more challenging. By comparing semi-supervised segmentation methods with a fully supervised setting using the partial dataset, i.e., the result in the first row of Table III, we observe stronger performance, showing that leveraging unlabeled data can improve the segmentation performance, which confirms common sense. Our method achieves superior performance compared with all previous works by a large margin (), which indicates that our method can make a better use of unlabeled data.

Iv-E Ablation Studies

Here we ablate each component of our proposed framework on Pancreas dataset with as labeled (Table IV) and on Colon dataset with as labeled (Table V). We gradually add our proposed component and showcase their performances in terms of four metrics.

On both datasets, as shown in the second row of Table IV and V, by applying AUA to measure the consistency between the teacher and the student models, we can achieve a superior performance over the fully-supervised model, i.e., V-Net. On top of AUA, BCL can boost performance further, showing that feature representation learning can provide complementary regularization to consistency regularization. Also, with more accurate pseudo labels achieved, by simply applying pseudo labeling, we can observe performance improvement, as shown in the fourth row of Tables IV and V. Training with PCL on top of pseudo labeling can improve the second stage performance by 0.73% and 2.25% on Pancreas and Colon datasets, respectively, implying that given inevitable noises in pseudo labels, feature regularization in a pseudo label-aware manner brings more benefits than negative effects of noisy training. Most importantly, the performance gains from BCL and PCL prove the effectiveness of the proposed stage-adaptive feature regularization, which confirms our motivating intuition.

Iv-F A Qualitative Comparison with the State-of-the-arts

We visualize the segmentation predictions obtained from other state-of-the-art methods and ours in Figure 3. We highlight true positive, false negative and false positive pixels in red, green and blue, respectively. In the first case shown in the first row, we can see that the prediction of our method has a greater overlap with the ground truth while the predictions of other methods fail to recall many foreground pixels. In the second case, with a similar number of true positives, our method suffers less from false positives and false negatives compared with the other methods.

V Conclusion

This paper presents a simple yet effective two-stage framework for semi-supervised medical image segmentation, with the key idea of exploring the feature representation from labeled and unlabeled images. The first stage aims to generate high-quality pseudo labels and the second stage refines the network with both labeled and pseudo labeled images. To generate high-quality pseudo labels, we propose a novel consistency regularization technique, called AUA, to adaptively encourage the consistency by considering the ambiguity of medical images. The proposed stage-adaptive contrastive learning method, including a boundary-aware contrastive loss and a prototype-aware contrastive loss enhances the representation learning with labeled images and both labeled and pseudo labeled images in the first and second stages respectively. Our method achieves the best results on two public medical image segmentation benchmarks, and the ablation study validates the effectiveness of our proposed method. Our future works include extending this work to different types of medical data, such as X-ray images, fundus images and surgical videos.