I Introduction
Medical image segmentation is a foundational task for computeraided diagnosis and computeraided surgery. In recent years, considerable efforts have been devoted to design neural networks for medical image segmentation, such as UNet
[ronneberger2015u], DenseUNet [li2018h], nnUNet [isensee2020nnu], HyperDenseNet [dolz2018hyperdense]. However, training these models requires a large amount of labeled images. Unlike natural images, the professional expertise required for pixelwise manual annotation of medical images makes such labeling tasks challenging and timeconsuming, resulting in difficulty of obtaining a large labeled dataset. Hence, semisupervised learning, which enables training using labeled and unlabeled data, becomes an active research area for medical image segmentation.
A common assumption of semisupervised learning is that the decision boundary should not pass through highdensity regions. Consistency regularizationbased techniques
[DBLP:conf/miccai/YuWLFH19, luo2021urpc, li2020transformation] achieve a decision boundary at a lowdensity area by penalizing prediction variation under different input perturbations. Entropy minimizationbased methods aim to achieve highconfidence predictions for unlabeled data either in an explicit manner [DBLP:conf/nips/GrandvaletB04] or an implicit manner [lee2013pseudo, DBLP:conf/miccai/SedaiARJ0SWG19, DBLP:journals/tmi/FanZJZCFSS20, reiss2021every]. As shown in Figure 1, an ideal model should pull together data points of the same class and push apart data points from different classes in the feature space. As the training set of semisupervised learning includes labeled and unlabeled images, it is challenging to directly optimize the unlabeled images in the feature space without explicit guidance. We observe that with unlabeled images, most semisupervised methods [DBLP:conf/miccai/YuWLFH19, luo2021urpc, li2020transformation] can achieve more accurate segmentation results than the model trained with only labeled data. Therefore, the pseudo segmentation predicted by a semisupervised model on unlabeled data could possibly be made even more stable and precise.Motivated by this observation, we present a simple yet effective twostage framework for semisupervised medical image segmentation with the key idea to explore representation learning for segmentation from both labeled and unlabeled images. The first stage aims to generate highquality pseudo labels, and the second stage aims to use pseudo labels to retrain the network to regularize features for both labeled and unlabeled images. Existing uncertaintybased semisupervised methods [DBLP:conf/miccai/YuWLFH19, DBLP:journals/tmi/CaoCLPWC21, DBLP:conf/miccai/WangZTZSZH20] have achieved stunning results by considering the reliability of the supervision for the unlabeled images. These methods exploit the epistemic uncertainty, a kind of uncertainty about the model’s parameters arising from a lack of data, either in the output space [DBLP:conf/miccai/YuWLFH19, DBLP:journals/tmi/CaoCLPWC21, DBLP:conf/miccai/WangZTZSZH20] or in the feature space [DBLP:conf/miccai/WangZTZSZH20], as guidance for identifying trustworthy supervision. Medical images are often noisy, and the boundaries between tissue types may not be well defined, leading to a disagreement among human experts [DBLP:conf/nips/KohlRMFLMERR18, DBLP:conf/miccai/BaumgartnerTCHM19, DBLP:conf/nips/MonteiroFCPMKWG20]. However, aleatoric uncertainty that represent the ambiguity about the input data and is irreducible by obtaining more data, is ignored in these methods.
To obtain highquality pseudo labels for unlabeled images, we present an Aleatoric Uncertainty Adaptive method, namely AUA, for semisupervised medical image segmentation. Under the framework of the mean teacher model [tarvainen2017mean]
, to obtain reliable target supervision for unlabeled data, instead of estimating the model’s epistemic uncertainty
[DBLP:conf/miccai/YuWLFH19, DBLP:journals/tmi/CaoCLPWC21, DBLP:conf/miccai/WangZTZSZH20], we explore the aleatoric uncertainty of the model for noisy input data. AUA first measures the spatially correlated aleatoric uncertainty by modeling a multivariate normal distribution over the logit space. To effectively utilize unlabeled images, AUA encourages the prediction consistency between the teacher model and the student model by adaptively considering the aleatoric uncertainty for each image. Specifically, the consistency regularization automatically emphasizes the input images with lower aleatoric uncertainty,
i.e., input images with less ambiguity.In the second stage, we retrain the network with pseudo labels. To effectively regularize feature representation learning in both stages, we propose stageadaptive feature regularization, including a boundaryaware contrastive loss in the first stage and a prototypeaware contrastive loss in the second stage. The main idea of boundaryaware contrastive loss is to fully leverage labeled images for representation learning. A straightforward solution is to pull together the pixels to the same class and push away pixels from different classes using a contrastive loss. However, medical images usually contain a large number of pixels. Simply utilizing contrastive loss would lead to a high computational cost and memory consumption. To this end, we present a boundaryaware contrastive loss, where only randomly sampled pixels from the segmentation boundary are optimized. In the second stage, to effectively utilize both labeled and pseudolabeled images, i.e., unlabeled images for representation learning, we present a prototypeaware contrastive loss with each pixel’s feature pulled closer to its class centroid, i.e., prototype, and pushed further away from the class centroids it does not belong to. The main intuition is that the trained model can generate pseudo labels for unlabeled images in the second stage. Compared with the boundaryaware contrastive loss, the prototypeaware contrastive loss better leverages the pseudo labels, especially those that may not occur at the segmentation boundaries.
In summary, this paper makes the following contributions:

We present AUA, an aleatoric uncertainty adaptive consistency regularization method where the student model can learn from the teacher model in an adaptive manner according to the estimated aleatoric uncertainty.

We introduce a stageaware method to explore feature representation learning in a semisupervised setting. A boundaryaware contrastive loss is developed to enhance the segmentation with only labeled images and a prototypeaware contrastive loss is proposed to improve the result with both labeled and pseudo labeled images.

Our method achieves the stateoftheart performance on two public datasets. Ablation study validates the effectiveness of our proposed AUA and feature representation methods. Our code will be released at GitHub https://github.com/XMedLab/FRL_SemiMedSeg upon acceptance.
Ii Related Work
We briefly discuss related works in semisupervised medical image segmentation, including pseudo labeling and consistency regularization. We also discuss some techniques related to contrastive learning and uncertainty estimation.
Iia Semisupervised Medical Image Segmentation
Semisupervised learning (SSL) refers to train the model with both labeled and unlabeled images. For medical image segmentation, Early work used graphbased methods [DBLP:journals/tmi/SuYHKZ16, DBLP:conf/icpr/BorgaAL16]
for semisupervised segmentation. Recently, semisupervised medical image segmentation has featured deep learning. The existing methods can be broadly classified into two categories: pseudo labelingbased
[lee2013pseudo, reiss2021every, DBLP:conf/cvpr/XieLHL20, DBLP:journals/corr/abs201200827, DBLP:journals/mia/XiaYYLCYZXYR20] and consistency regularizationbased methods [DBLP:conf/miccai/YuWLFH19, luo2021urpc, li2020transformation, DBLP:journals/tmi/CaoCLPWC21, DBLP:conf/miccai/WangZTZSZH20, DBLP:conf/miccai/HangFLYWCQ20, DBLP:conf/miccai/FangL20, DBLP:journals/corr/abs210302911, luo2021semi, li2018semi, DBLP:conf/miccai/BortsovaDHKB19, DBLP:conf/miccai/LiZH20, DBLP:conf/miccai/YangSKW20].Pseudo Labelingbased Methods. Pseudo labelingbased methods handle label scarcity via estimating pseudo labels on unlabeled data and using all the labeled and pseudo labeled data to train the model. Selftraining is one of the most straightforward solutions [lee2013pseudo, DBLP:conf/cvpr/XieLHL20, DBLP:journals/corr/abs201200827] and has been extended to the biomedical domain for segmentation [DBLP:conf/miccai/SedaiARJ0SWG19, DBLP:journals/tmi/FanZJZCFSS20, DBLP:conf/miccai/BaiOSSRTGKMR17]. The main idea of selftraining is that the model is first trained with labeled data only and then generate pseudo labels for unlabeled data. By retraining the model with both labeled and pseudo labeled images, the model performance can be enhanced. The model can be trained iteratively with these two processes until the model performance becomes stable and satisfactory. To reduce the noise in pseudo labels, different methods have been developed, including identifying trustworthy pseudo labels by uncertainty estimation [DBLP:conf/miccai/SedaiARJ0SWG19], using a conditional random field (CRF) [DBLP:conf/nips/KrahenbuhlK11] to refine pseudo labels [DBLP:conf/miccai/BaiOSSRTGKMR17] or using pseudo labels only for finetuning [DBLP:journals/tmi/FanZJZCFSS20]. In addition to such an offline pseudo label generation strategy, online selftraining methods [DBLP:conf/miccai/LiCXMZ20, reiss2021every] have been developed recently where pseudo labels are generated after each forward propagation and used as an immediate supervision.
Another pseudo labelingbased method is cotraining [DBLP:conf/colt/BlumM98, DBLP:conf/eccv/QiaoSZWY18, DBLP:conf/ijcai/ChenWGZ18] where multiple learners are trained and their disagreement on unlabeled data is exploited for improving accuracy of pseudo labels. The basic idea is that each learner could learn different and complementary information from the other learners. In some selftraining methods, more than one learner can be used, such as in [reiss2021every] and the supervision on unlabeled data is unidirectional. For example, the teacher model [tarvainen2017mean] generates pseudo labels to supervise the student model, while in a dualmodel cotraining method such as [DBLP:journals/mia/XiaYYLCYZXYR20], supervision is bidirectional. Specifically, each base model’s supervision of unlabeled data is based on the fused predictions from the other base models, weighted by the confidence of each model.
However, these methods ignore the classaware feature regularization, which is a key focus of this study. We will demonstrate the importance of feature representation learning in learning with labeled and pseudo labeled images.
Consistency regularizationbased Methods. The goal of consistency regularizationbased semisupervised methods [tarvainen2017mean, DBLP:conf/iclr/LaineA17, DBLP:journals/pami/MiyatoMKI19] is to find the model that is not only accurate in predictions but also invariant to input perturbations to enforce the decision boundary traverse the lowdensity region of the feature space. One line of these methods considers invariance to input domain perturbations. For example, the temporal ensembling model [DBLP:conf/iclr/LaineA17] achieves promising results by accumulating soft pseudo labels on randomly perturbated input images. An extension work with soft pseudo label accumulation guided by epistemic uncertainty was proposed in [DBLP:journals/tmi/CaoCLPWC21]. When epistemic uncertainty of the prediction is high, it will contribute less to pseudo label accumulation. The mean teacher model [tarvainen2017mean] achieves invariance to input perturbations by promoting consistency between the predictions of the teacher and the student models where input images fed to the teacher model are added with noises. Extensions have also been made from the perspective of reliability evaluation [DBLP:conf/miccai/YuWLFH19, DBLP:conf/miccai/WangZTZSZH20] to provide reliable supervision from the teacher model to the student model or considering structural information of foreground objects [DBLP:conf/miccai/HangFLYWCQ20].
In addition to input domain perturbation, other perturbations that would not change the semantics of the prediction have also been designed and used to promote consistency. For example, a consistency among predictions given by differently designed decoders [DBLP:conf/miccai/FangL20, DBLP:journals/corr/abs210302911], or at different scales [luo2021urpc] or with different modalities [luo2021semi] where perturbations are beyond the input level should be maintained and should lead to semanticsinvariance. Aside from perturbations that lead to invariance in output, there is another line of studies [li2018semi, li2020transformation, DBLP:conf/miccai/BortsovaDHKB19] that promotes equavariance between the input and the output because some input space transform, especially spatial transform such as rotations, should lead to the same transform in the output space.
Unlike these existing methods that are based on consistency regularization, our method is a twostage framework, which improves the overall framework by regularizing the feature representation. Moreover, we introduce AUA, an aleatoric uncertaintyaware method, to represent inherent ambiguities in medical images and enhance the segmentation performance by encouraging the consistency for images with low ambiguity.
IiB Contrastive Learning in Semisupervised Image Segmentation.
Note that we exclude selfsupervised learning methods where unlabeled data are only used for taskagnostic purpose, i.e., pretraining such as in [DBLP:conf/nips/ChaitanyaEKK20], even though performance under semisupervised setting is also reported. We only consider contrastive learning for taskspecific use [DBLP:conf/brainlesws/IwasawaHS20, lai2021semi, DBLP:journals/corr/abs201206985]. Among these works, only [DBLP:journals/corr/abs201206985]’s goal is to promote interclass separation and intraclass compactness. However, in [DBLP:journals/corr/abs201206985], pseudo labels are obtained from the model trained with labeled data only, whose performance is inferior to our first stage model, where pseudo labels are obtained from a model that takes advantage of consistency regularization on unlabeled data and feature regularization on labeled data. In [lai2021semi], interclass separation is considered by taking pixels with different pseudo labels as negative pairs, but intraclass compactness is ignored since the positive pair is built on the same pixels from different crops, which essentially is an extension of instance discrimination for the segmentation task. To the best of our knowledge, ours is the first study with pixellevel feature regularization aiming at intraclass compactness and interclass separation for semisupervised medical image segmentation.
IiC Uncertainty Estimation in Semisupervised Medical Image Segmentation.
Uncertainties generally fall into two categories: epistemic and aleatoric. Epistemic uncertainty is about a model’s parameters caused by a lack of data while aleatoric uncertainty is caused by intrinsic ambiguities or randomness of input data and cannot be reduced by introducing more data. Early methods measure uncertainty using particle filtering and CRFs [DBLP:journals/ijcv/BlakeCZ93, DBLP:conf/cvpr/HeZC04]
. More recently, in Bayesian networks, epistemic uncertainty is usually estimated with Monte Carlo Dropout
[DBLP:conf/nips/KendallG17], which has been extended for the semisupervised medical image segmentation task [DBLP:conf/miccai/YuWLFH19, DBLP:journals/tmi/CaoCLPWC21, DBLP:conf/miccai/WangZTZSZH20]. Aleatoric uncertainty is estimated either without considering correlations between pixels [DBLP:conf/nips/KendallG17] or with a limited ability to model spatial correlation since it is captured by uncorrelated latent variables from multivariate normal distribution [DBLP:conf/nips/KohlRMFLMERR18, DBLP:conf/miccai/BaumgartnerTCHM19]. Monteiro et al. [DBLP:conf/nips/MonteiroFCPMKWG20] proposed an aleatoric uncertainty estimation technique where correlations between pixels are considered. Given the ubiquitous existence of noises or ambiguities in medical images, aleatoric uncertainty has been overlooked for semisupervised medical image segmentation. In this work, we propose an aleatoric uncertainty adaptive consistency regularization technique, where correlations between pixels are considered when measuring aleatoric uncertainty.Iii Method
Figure 2 visualizes the overview of the proposed twostage framework. The input image is first fed into the AUA module to get a segmentation model which generates a highquality pseudo label. Then, we introduce the stageadaptive contrastive learning method, consisting of a boundaryaware contrastive loss (BCL) on labeled data only in the first stage and a prototypeaware contrastive loss (PCL) on all data at the second stage. By sequentially training through the first and second stages, we generate the final segmentation results.
Iiia Aleatoric Uncertainty Adaptive Consistency Regularization (AUA)
Under semisupervised settings where limited labeled data are available, the medical image segmentation model could make unreliable predictions on unlabeled data. It is a desirable property if the segmentation model can be aware of its chance of making mistakes. Aleatoric uncertainty, the kind of uncertainty arising from ambiguities in input data that hinders the segmentation performance, can serve as an indicator reflecting when a model may not perform well. While image ambiguity is a significant issue for medical image segmentation, this effect has not been taken into account in previous semisupervised medical segmentation studies. This work proposes a new consistency regularization technique where aleatoric uncertainty is used to guide how much the student model should learn from the teacher model.
We first introduce how to estimate aleatoric uncertainty in a stochastic segmentation network. Here, we consider a class segmentation task on 3D volumes with size , where , , denote the height, width and depth, respectively. Given an image and its ground truth
with the same size, the loss function of a general segmentation network is designed to minimize the negative loglikelihood, formulated as:
(1) 
where denotes logits.
In a deterministic segmentation network, i.e., assuming and independence of each pixel’s prediction on the other, where is a neural network parameterized by and denotes Dirac delta function, the loss function in Eq. 1 can be rewritten as:
(2) 
For simplicity, we use a onedimensional scalar to index each pixel out of a whole set of pixels in a 3D volume. The above equation is the crossentropy function commonly used in segmentation models.
In a stochastic segmentation network, to represent interdependence among pixels and inherent ambiguities of input data, we follow [DBLP:conf/nips/MonteiroFCPMKWG20]
and assume a multivariant Gaussian distribution around logits,
i.e., , with and . MonteCarlo integration of samples is applied to approximate the intractable integral operation, leading us from Eq. 1 to:(3)  
(4) 
The logsumexp operation is for numerical stability and we refer the calculation of to Eq. 2, where is a sample out of . As pointed out by [DBLP:conf/nips/MonteiroFCPMKWG20], the fullrank covariance matrix is computationally infeasible, so we also adopt a lowrank approximation defined as:
(5) 
where denotes the factor part of a lowrank form of the covariance matrix and denotes the diagonal part. In this way, the computational complexity can be reduced from quadratic to linear with and where denotes the rank.
Given an unlabeled image , the predicted distribution by the student model parameterized by is denoted as . Similarly, we can obtain the teacher model’s prediction over the perturbed version of the same input by Gaussian noise injection, where parameters of the teacher model, denoted as , are updated with an exponential moving average of the parameters of the student model. The consistency between the teacher model’s predictions and the student model’s predictions on unlabeled data is encouraged by minimizing the generalised energy distance [DBLP:conf/nips/KohlRMFLMERR18, szekely2013energy], which is defined as:
(6)  
To approximate the intractable expectation operation in Eq. 6, we take samples out of and , respectively. The consistency regularization loss function can be reformulated as:
(7)  
In Eq. 7, is defined as the Generalized Dice loss [sudre2017generalised]:
(8)  
where indexes each pixel out of a whole set of pixels in a 3D volume and indexes each class out of a total of classes.
The optimum of Eq. 7 is 0, which means the optimum of the first term is the sum of the last two. This consistency regularization metric is adaptive to aleatoric uncertainty in the sense that if the diversity between samples of the student (or the teacher) model is high, i.e., the values of the last two terms of Eq. 7 are large, indicating a high aleatoric uncertainty, the pairwise similarity of samples from the student and the teacher models, denoted by the first term of Eq. 7, would be less strictly constrained. On the contrary, on an input data where a low diversity is estimated, implying the aleatoric uncertainty is low and the model is more likely to generalize well, the student model automatically learns more from the teacher model by optimizing the first term to a smaller value.
To summarize, AUA loss is defined as follows:
(9) 
where is the scaling weight to balance the uncertainty estimation loss and the generalised energy distance loss.
IiiB Stageadaptive feature regularization
We introduce a stageadaptive feature learning method consisting of a boundaryaware contrastive loss and a prototypeaware contrastive loss, to enhance the representation learning with only labeled images and both labeled and pseudo labeled images, respectively. A natural solution is contrastive loss with features of pixels belonging to the same class (i.e., both foreground pixels or both background pixels) as positive pairs and features belonging to different classes (i.e., one from foreground the other from background) as negative pairs. This strategy allows us to perform pixelwise regularization but consumes memory quadratically to the number of pixels, so we propose a stageadaptive constrastive learning method with these concerns properly handled. To reduce the computational cost, at the first stage, we only optimize the feature representation for pixels around the segmentation boundaries, using a boundaryaware contrastive loss (BCL). At the second stage, with more accurate pseudo labels on unlabeled data, we introduce a prototypeaware contrastive loss (PCL) to fully leverage both labeled and pseudo labeled images for representation learning.
IiiB1 Boundaryaware contrastive learning
As a balance of benefits of pixelwise feature level regularization and computational costs, we build positive and negative pairs based on a random subset of nearboundary pixels, arriving at the boundaryaware contrastive loss formally defined as:
(10) 
where contains indexes of randomly sampled near boundary pixels from an input image, contains indexes of the other pixels except pixel and contains indexes of pixels in belonging to the same class as pixel
. The feature vectors
, and are obtained from a 3layer convolutional projection head, which is connected after the layer before the last layer. The temperature is set to be 0.07 by following [DBLP:conf/nips/KhoslaTWSTIMLK20].IiiB2 Prototypeaware contrastive learning
At the second stage, the way to regularize an indiscriminative feature space as in Figure 1(a) is to encourage each feature to be closer to any other pixels that share the same label and further away from the centroid of opposite class so that forming a feature space in Figure 1(b) is encouraged, which is defined as:
(11)  
where contains indexes of all pixels. and contains the indexes of positive pixels, i.e., those sharing to the same class, and negative pixels , i.e., those with different labels to pixel
, respectively. Features extracted from the second stage model are denoted as
where can be an index of any pixel.In [DBLP:journals/corr/abs210505013], by assuming a Gaussian distribution for features belonging to each class, the computational cost of Eq. 11 can be reduced from quadratic to linear, leading to a regularization formulated as:
(12)  
where and are mean and covariance matrix of the positive class to pixel and similarly, and are mean and covariance matrix of the negative class corresponding to pixel . These prototype statistics for each class are estimated from the first stage model with an moving average update of extracted features with each update at step formulated as:
(13)  
where denotes the total number of pixels belonging to class seen before time step , and denotes pixel number of class in the loaded image at time step . and denote the mean and covariance, respectively, of features belonging to class in images at . We refer readers to [DBLP:journals/corr/abs210505013] for detailed derivation. The final prototypes are estimated after 3000 iterations and the temperature is set to be 100 by following [DBLP:journals/corr/abs210505013].
IiiC Stagewise Training as a Unified Framework
To summarize, in the first stage, the loss function is defined as:
(14) 
where is the scaling weight for BCL loss. To this end, pseudo labels on unlabeled data with a higher quality can be obtained thanks to joint prediction regularization (with AUA) and feature regularization (with BCL), which enables retaining a stronger segmentation model at the second stage by regularizing both predictions and features over the whole dataset in a labelaware manner. The loss function at the second stage is as follows:
(15) 
where is defined as the average of crossentropy loss and Dice loss as a common practice in segmentation, which serves as pseudo labeling and is the weight for PCL loss.
Method  scans used  Metrics  
Labeled  Unlabeled  Dice  Jaccard  ASD[voxel]  95HD[voxel]  
VNet  12  0  70.63  56.72  6.29  22.54 
VNet  62  0  81.78  69.65  1.34  5.13 
MT [tarvainen2017mean]  12  50  75.85  61.98  3.40  12.59 
DAN [DBLP:conf/miccai/ZhangYCFHC17]  12  50  76.74  63.29  2.97  11.13 
Entropy Mini[DBLP:conf/cvpr/VuJBCP19]  12  50  75.31  61.73  3.88  11.72 
UAMT [DBLP:conf/miccai/YuWLFH19]  12  50  77.26  63.82  3.06  11.90 
CCT [DBLP:conf/cvpr/OualiHT20]  12  50  76.58  62.76  3.69  12.92 
SASSNet [DBLP:conf/miccai/LiZH20]  12  50  77.66  64.08  3.05  10.93 
DTC [luo2021semi]  12  50  78.27  64.75  2.25  8.36 
Ours  12  48  79.81  66.82  1.64  5.90 
Method  scans used  Metrics  
Labeled  Unlabeled  Dice  Jaccard  ASD[voxel]  95HD[voxel]  
VNet  3  0  30.74  18.84  6.97  26.45 
VNet  60  0  81.46  69.18  1.31  5.09 
MT[tarvainen2017mean]  3  57  31.09  18.77  28.14  59.22 
UAMT[DBLP:conf/miccai/YuWLFH19]  3  57  34.46  21.24  25.73  57.40 
DTC[luo2021semi]  3  57  48.47  32.71  17.03  42.61 
SASSNet[DBLP:conf/miccai/LiZH20]  3  57  51.96  36.03  16.08  45.36 
Ours  3  57  56.18  40.05  12.47  34.85 
Method  scans used  Metrics  
Labeled  Unlabeled  Dice  Jaccard  ASD[voxel]  95HD[voxel]  
VNet  5  0  34.07  23.09  10.12  26.52 
VNet  100  0  62.31  49.47  2.14  13.49 
MT[tarvainen2017mean]  5  95  38.64  26.38  14.41  33.08 
UAMT[DBLP:conf/miccai/YuWLFH19]  5  95  40.61  28.01  15.31  34.92 
SASSNet[DBLP:conf/miccai/LiZH20]  5  95  41.64  30.07  11.93  28.96 
DTC[luo2021semi]  5  95  43.29  29.84  10.62  26.22 
Ours  5  95  49.00  35.15  9.04  22.32 
Iv Experimental Results
Iva Datasets and Preprocessing
Pancreas CT dataset. Pancreas CT dataset [DBLP:conf/miccai/RothLFSLTS15] is a public dataset containing 80 scans with a resolution of 512512 pixels and slice thickness between 1.5 and 2.5 mm. Each image has a corresponding pixelwise label, which is annotated by an expert and verified by a radiologist.
Colon cancer segmentation dataset. Colon cancer dataset is a subset from Medical Segmentation Decathlon (MSD) datasets [simpson2019large], consisting of 190 colon cancer CT volumes. Pixellevel label annotations are given on 126 CT volumes. Among these volumes, we randomly split 26 CT volumes as a test set and use the rest for training.
Preprocessing. To fairly compare with other methods, we follow preprocessing in [luo2021semi] by clipping CT images to a range of [125, 275] HU values, resampling images to 111
resolution, centercropping both raw images and annotations around foreground area with a margin of 25 voxels and finally normalizing raw images to zero mean and unit variance. On the Pancreas dataset, we apply random crop as an augmentation on the fly, and the Colon dataset is augmented with random rotation, random flip, and random crop. On both datasets,
subvolumes are randomly cropped from raw data and fed to the segmentation model for training.IvB Implementation Details
Environment.
All experiments in this work are implemented in Pytorch 1.6.0 and conducted under python 3.7.4 running on an NVIDIA TITAN RTX.
Backbone. VNet [DBLP:conf/3dim/MilletariNA16] is used as our backbone where the last convolutional layer is replaced by a 3D 111 convolutional layer. On top of that, a projection module and an aleatoric uncertainty module are built for feature regularization and aleatoric uncertainty estimation, respectively. Similar to [DBLP:journals/corr/abs201206985]
, the projection head constitutes 3 convolutional layers, each followed by ReLU activations and batch normalization, except for the last layer, which is followed by a unitnormalization layer. The channel size of each convolutional layer is set as 16. The aleatoric uncertainty module is comprised of three 1layer branches predicting means, covariance factors, and covariance diagonals respectively.
Training details. Our model is trained with an SGD optimizer with 0.9 momentum and 0.0001 weight decay for 6000 iterations. A step decay learning rate schedule is applied where the initial learning rate is set to be 0.01 and dropped by 0.1 every 2500 iterations. For each iteration, a training batch containing two labeled and two unlabeled subvolumes is fed to the proposed model, with each subvolume randomly cropped with the size of 9696
96. On the test set, predictions on subvolumes with the same size using a sliding window strategy with a stride of 16
1616 are fused to obtain the final results.Evaluation metrics. We use Dice (DI), Jaccard (JA), the average surface distance (ASD), and the 95 Hausdorff Distance (95HD) to evaluate the effectiveness of our semisupervised segmentation method.
IvC Results on Pancreas Dataset
Since the predictions on unlabeled data may be inaccurate in the early stage of training, we follow common practices [DBLP:conf/miccai/YuWLFH19, luo2021semi] and use a Gaussian ramping up function to control the strength of consistency regularization, where denotes current time step and denotes the maximal training step, i.e., 6000 as introduced previously. The constant used to scale BCL, i.e., is set to be 0.09 given labeled data and 0.01 given labeled data. In the second stage of training, the PCL weight is set to be 0.1.
Table I shows the results on the pancreas dataset. Previous methods are benchmarked on the first version of Pancreas dataset with 12 labeled volumes and 50 unlabeled volumes, where, however, 2 duplicates of scan 2 are found. In case some of these three samples are in the training set and the rest are in the test set after a random split, we use version 2 where two duplicates are removed, leaving us the same number of labeled data but 2 less, i.e., 48 unlabeled data. Even under a more strict scenario, our proposed model achieves the best performance among existing works.
The first row, i.e., a fully supervised baseline on partial dataset, shows the lower bound of the semisupervised segmentation methods. Whereas the second row, i.e., a fully supervised model on a fully labeled dataset, shows the upper bound performance. We can observe that our method achieves 79.81% on Dice, surpassing the current stateoftheart by 1.54%. Notably, our method is very close to the fullysupervised model that employs all volumes supervised by human annotations, showing that the effectiveness of the proposed semisupervised method.
To further validate our method under a more challenging scenario, we reduce the number of labeled data to only 5% and use the rest 95% as unlabeled. As shown in Table II, in such a smalldata regime, a performance drop of every semisupervised learning method is observed compared to its counterpart in a bigdata regime in Table I where 20% labeled are available, which confirms the common sense. It is observed that our method consistently outperforms other methods. Specifically, our method surpasses all the other semisupervised methods and outperforms the current stateoftheart by 4.22% on Dice, which demonstrates that the effectiveness of our method is more obvious in a more challenging setting.
Method  Metrics  

Dice  Jaccard  ASD[voxel]  95HD[voxel]  
Supervised baseline  70.63  56.72  6.29  22.54 
AUA  76.13  62.19  2.25  9.35 
AUA + BCL (First stage)  77.15  63.34  2.04  7.00 
AUA + BCL + Pseudo labeling  79.08  65.91  1.91  6.69 
AUA + BCL + Pseudo labeling + PCL (Full)  79.81  66.82  1.64  5.90 
Method  Metrics  

Dice  Jaccard  ASD[voxel]  95HD[voxel]  
Supervised baseline  34.07  23.09  10.12  26.52 
AUA  42.74  30.20  15.00  35.43 
AUA + BCL (First stage)  43.70  30.92  14.74  33.34 
AUA + BCL + Pseudo labeling  46.75  33.62  12.39  28.49 
AUA + BCL + Pseudo labeling + PCL (Full)  49.00  35.15  9.04  22.32 
IvD Results on Colon Dataset
Table III shows the results on the colon dataset. To get a result, we set to be , and the scaling weight of BCL, i.e., is set to be 0.03. In the second stage of training, the weight PCL is set to be 0.1. We compare our method with several stateoftheart methods including MT [tarvainen2017mean], UAMT [DBLP:conf/miccai/YuWLFH19], SASSNet [DBLP:conf/miccai/LiZH20] and DTC [luo2021semi] using data as labeled and the rest as unlabeled. Again, we tune hyperparameters for previous methods so that these methods can reach the best performance on this dataset. In Table III, by comparing the second row with Table II, we notice under fully supervised setting using full dataset, the performance on Colon dataset is lower than Pancreas dataset, indicating Colon dataset is more challenging. By comparing semisupervised segmentation methods with a fully supervised setting using the partial dataset, i.e., the result in the first row of Table III, we observe stronger performance, showing that leveraging unlabeled data can improve the segmentation performance, which confirms common sense. Our method achieves superior performance compared with all previous works by a large margin (), which indicates that our method can make a better use of unlabeled data.
IvE Ablation Studies
Here we ablate each component of our proposed framework on Pancreas dataset with as labeled (Table IV) and on Colon dataset with as labeled (Table V). We gradually add our proposed component and showcase their performances in terms of four metrics.
On both datasets, as shown in the second row of Table IV and V, by applying AUA to measure the consistency between the teacher and the student models, we can achieve a superior performance over the fullysupervised model, i.e., VNet. On top of AUA, BCL can boost performance further, showing that feature representation learning can provide complementary regularization to consistency regularization. Also, with more accurate pseudo labels achieved, by simply applying pseudo labeling, we can observe performance improvement, as shown in the fourth row of Tables IV and V. Training with PCL on top of pseudo labeling can improve the second stage performance by 0.73% and 2.25% on Pancreas and Colon datasets, respectively, implying that given inevitable noises in pseudo labels, feature regularization in a pseudo labelaware manner brings more benefits than negative effects of noisy training. Most importantly, the performance gains from BCL and PCL prove the effectiveness of the proposed stageadaptive feature regularization, which confirms our motivating intuition.
IvF A Qualitative Comparison with the Stateofthearts
We visualize the segmentation predictions obtained from other stateoftheart methods and ours in Figure 3. We highlight true positive, false negative and false positive pixels in red, green and blue, respectively. In the first case shown in the first row, we can see that the prediction of our method has a greater overlap with the ground truth while the predictions of other methods fail to recall many foreground pixels. In the second case, with a similar number of true positives, our method suffers less from false positives and false negatives compared with the other methods.
V Conclusion
This paper presents a simple yet effective twostage framework for semisupervised medical image segmentation, with the key idea of exploring the feature representation from labeled and unlabeled images. The first stage aims to generate highquality pseudo labels and the second stage refines the network with both labeled and pseudo labeled images. To generate highquality pseudo labels, we propose a novel consistency regularization technique, called AUA, to adaptively encourage the consistency by considering the ambiguity of medical images. The proposed stageadaptive contrastive learning method, including a boundaryaware contrastive loss and a prototypeaware contrastive loss enhances the representation learning with labeled images and both labeled and pseudo labeled images in the first and second stages respectively. Our method achieves the best results on two public medical image segmentation benchmarks, and the ablation study validates the effectiveness of our proposed method. Our future works include extending this work to different types of medical data, such as Xray images, fundus images and surgical videos.
Comments
There are no comments yet.