Large fully-annotated datasets are crucial to the generalization ability of deep neural networks. However, the manual labeling of medical images requires great efforts from experienced clinical experts, which is both expensive and time-consuming. To alleviate it, existing works have exploited weakly labeled and unlabeled training data to assist model training, such as semi-supervised learning (SSL)[NIPS2017_68053af2, souly2017semi, mittal2019semi]
and weakly-supervised learning (WSL)[wei2016stc, pathak2015constrained, khoreva2017simple]. However, SSL generally requires part of the images in the dataset to be accurately and precisely annotated. As an alternative, we propose to investigate a specific form of WSL approaches, which only utilize scribble annotations for model training.
WSL is proposed to exploit weak annotations, such as image-level labels, sparse annotations, and noisy annotations [tajbakhsh2020embracing]. Among them, scribble, as images in Figure. 1 (a) illustrate, is one of the most convenient forms of weak label and has great potential in medical image segmentation [Can2018LearningTS]. However, due to the lack of supervision, it is still ardours to learn the shape priors of objects, which makes the segmentation of the boundaries particularly difficult.
The existing scribble learning mainly includes two groups. The first line of researches leverage a priori assumption to expand scribble annotation [tajbakhsh2020embracing], such as labeling pixels with similar gray values and similar positions in the same category [l2016inscribblesup, ji2019scribble]. However, the process of scribble expansion may generate noisy labels, which deteriorates the segmentation performance of trained models. The second one learns adversarial shape priors, but requires extra fully-annotated masks [9389796, Larrazabal2020PostDAEAP, zhang2020accl].
There is a line of augmentation strategies, well known as mixup, have been proposed, which focus on generating previously-unseen virtual examples [zhang2018mixup, devries2017cutout, yun2019cutmix, kimICML20, kim2021comixup]. However, these strategies are proposed for image classification, and they may change the shape priors of target objects, leading to unrealistic segmentation results for a segmentation task. When only scribble supervision is available, the segmentation performance using mixup augmentation could become even worse and unstable, due to the lack of precise annotations.
To address above mentioned challenges, we propose CycleMix to learn segmentation from scribbles. As illustrated in Figure. 1, CycleMix maximizes supervision of scribbles based on mix augmentation and random occlusion, and regularizes training of models using consistency losses. Firstly, we surmise that a segmentation model should benefit from finer gradient flow via larger portion of annotated pixels. Therefore, we propose the two-step mix augmentation strategy to augment supervision, including image combination to increase scribbles and random occlusion to reduce scribbles. In addition, we develop two-level consistency regularization, at both of the global and local levels. The global consistency loss penalizes the inconsistent segmentation of the same image patch in two scenarios, i.e., in the original image and mixed image; while the local consistency loss minimizes the distance between prediction and its largest connected component, exploiting the prior knowledge of anatomy that the target structures are interconnected.
The contributions of this paper are summarized as follows:
We propose a novel weakly-supervised segmentation framework for scribble supervision, i.e., CycleMix, by integrating mix augmentation of supervision and regularization of supervision from consistency, and introduce a new scribble annotated cardiac segmentation dataset of MSCMRseg.
To the best of our knowledge, the proposed CycleMix is the first framework to incorporate mixup strategies for augmentation of weakly-supervised segmentation, where one can achieve both increments and decrements of scribbles from the mixed training images.
We propose the consistency losses to regularize the limited supervision from scribbles by penalizing inconsistent segmentation results, at both the global and local levels, which can lead to profound improvement of model performance.
CycleMix has been evaluated on two open datasets, i.e., ACDC and MSCMR, and demonstrated promising performance by generating comparable or even better segmentation accuracy than the fully-supervised approaches.
2 Related works
2.1 Learning from scribble supervision
Scribble refers to sparse annotations where masks are provided for a small fraction of pixels in images [tajbakhsh2020embracing]. Existing methods mostly used selective pixel loss for annotated pixels. There are works [bai2018recurrent, l2016inscribblesup, ji2019scribble] attempting to expand scribbles or reconstruct the complete mask for model training. However, the pixel-relabeling process required iterative training, which is slow and prone to noisy labels. To avoid relabeling, several works utilized conditional random field to refine the segmentation results in post-processing [chen2017deeplab, Can2018LearningTS] or as trainable layer [zheng2015conditional, Tang2018OnRL]. However, these methods could not provide better supervision for model training. Other works [9389796, zhang2020accl] included a new module to evaluate the quality of segmentation masks, which encourages the predictions to be realistic. For example, Gabriele et al.  proposed the multi-scale attention gates in adversarial training, Zhang et al. [zhang2020accl] used PatchGAN discriminator [isola2017image] to leverage shape priors. However, these methods required additional data source of complete masks.
2.2 Mixup augmentations
Data augmentation plays a vital role in preventing models from overfitting to the limited training data and enhancing the generalization ability of neural networks. Mixup augmentations refer to a line of strategies which combine two images and corresponding labels [zhang2018mixup, devries2017cutout, yun2019cutmix, kimICML20, kim2021comixup]. Compared with conventional augmentation methods, i.e., rotation and flipping, mixup approaches can increase scribble annotations of augmented image through mix operation. Zhang et al. [zhang2018mixup]
introduced MixUp, which performed linear interpolation between two images and their labels. Manifold MixUp in[verma2019manifold] extended the mixup operation of input images to hidden features. Cutout in [devries2017cutout] randomly dropped out the square regions of images, and CutMix in [yun2019cutmix] replaced the dropped areas with patches from other images. Puzzle Mix in [kimICML20] introduced a new mixup method based on saliency and local statistics. Co-mixup in [kimICML20] extended the mixup between two images to multiple images, and encouraged the supermodular diversity of mixed images.
In medical imaging, mixup augmentation has been applied to semi-supervised image segmentation [chaitanya2019semi] and object detection tasks [wang2020focalmix]. Chaitanya et al. [chaitanya2019semi] concluded that mixup could lead to an impressive performance gain on semi-supervised segmentation. Although the mixed images might not look realistic, the mixed soft labels can provide more information to facilitate the training of models [chaitanya2019semi, hinton2015distilling].
2.3 Consistency regularization
Consistency strategies take advantage of the fact that if the same image is perturbed, the segmentation results should remain consistent. Consistency regularization has been widely applied in image-translation and semi-supervised learning. CycleGAN [zhu2017unpaired]
leveraged forward-backward consistency to enhance the ability of image-to-image translation. In semi-supervised setting, consistency is enforced over two augmentation versions of input images to obtain stable predictions of unlabeled images[laine2016temporal, NIPS2017_68053af2, ouali2020semi]. In this work, we propose to utilize consistency at both of the global and local levels to leverage the mix-invariant property and interconnected fact of segmentation structures.
The proposed CycleMix is composed of two new strategies, i.e., mix augmentation of scribble supervision and cycle consistency for regularization of supervision. The former is aimed to achieve the increments and decrements of scribbles by two-step mix-based image combination and random occlusion; the latter is designed to regularize the supervision in model training via two-level consistency penalty. Figure. 2 presents framework of neural network implementation of CycleMix.
3.1 Mix augmentation of scribble supervision
In this section, we extend the mixup strategy to the two-stage augmentation of scribble supervision. In the first stage, we increase the amount of scribbles by image combination, referred to increments of scribbles, which is to mixup two images to maximize the saliency. In the second stage, we perform an operation of random occlusion, by replacing certain area containing scribbles with background, which results in decrements of scribbles. Finally, the augmentation of supervision is achieved via a dedicated loss function from the generated mixup images.
3.1.1 Increments of scribbles
We surmise that increasing scribbles will benefit from finer gradient flow through larger proportions of annotated pixels. Furthermore, we observe that the scribble-annotated area generally has high saliency. Therefore, we propose to maximize the scribble annotation of mixed images to efficiently obtain the maximization of saliency of mixed training images. Here, we adopt the Puzzle Mix in [kimICML20] to utilize saliency and local statistic features. Note that the proposed method is applicable to other mixup strategies, such as MixUp [zhang2018mixup], CutMix [yun2019cutmix] and Co-mixup [kim2021comixup]. Readers could refer to the supplementary material for a comparison study.
We apply Puzzle Mix to both images and their corresponding scribble labels. Let two -dimensional images with annotations be , . The mixed result transported from the two training data, denoted as , is computed by:
where is the mixup function on and ; and represent the transportation matrix of dimension ; denotes a mask in of dimension ; refers to the element-wise multiplication. The parameter set, , is aimed to maximize the saliency of mixed image, which is computed by,
where is the saliency of image and is computed by taking the norm of the gradient value. For this optimization, one could refer to [kimICML20] for more details.
3.1.2 Decrements of scribbles
To further augment scribble supervision, we propose to randomly occlude a region containing scribbles from the mixed images, to generate more training images. This strategy results in decrements of scribbles in the mixed image, and has been proved to be effective in enhancing performance of object localization [yun2019cutmix].
Let be the pair of new training data generated from . We apply a randomly rotated rectangular area to occlude the image and turns the occluded scribbles into background,
where is a binary rectangular mask of dimension . In our experiment, we chose a rectangle with size of .
3.1.3 Scribble supervision
For scribble supervision, we apply the cross-entropy function solely on the annotated pixels, ignoring the unlabeled pixels whose ground truth labels are unknown. Hence, the loss for unmixed samples and is formulated as:
where, is the predicted segmentation of , and,
where, is the index set of labels, indicate the
-element of label vector of the-th pixel,
equals the probability of-th pixel belongs to the -th class, and refers to the set of pixels with scribble annotation, to which loss is applied.
Furthermore, since the operation of Puzzle Mix is not symmetric, namely , we use a symmetrical loss, referred to as mixed loss , for the generated samples and ,
The loss for augmented scribble supervision is given by,
where , are the balancing parameters.
3.2 Regularization of supervision via cycle consistency
In this section, we introduce two regularization terms, i.e., the global consistency loss and the local consistency loss.
3.2.1 Global consistency
The objective of global consistency is to leverage the mix-invariant property, which requires the same image patch to behave consistently in two scenarios, i.e., the original image and the mixed image. Therefore, we propose the global consistency loss to penalize the inconsistent segmentation.
For images and their mixed image , the corresponding segmentation is represented as , where is the segmentor. Assume the parameters of mixing function, i.e., , , and in in Eq. (2), remain unchanged, one should have,
This is the global consistency requiring the mixed segmentation of image and to be consistent with the segmentation of the mixed image after the same mixing operation. Taking the random occlusion operation into account, we modify Eq. (9) as follows,
We propose to use a symmetrical metric based on the negative cosine similarity between two segmentation results as the global consistency loss[chen2020simsiam, grill2020bootstrap],
where, and are respectively the mixed segmentation and segmentation of mixed image, and likewise for and ; is the negative cosine similarity and is defined as,
3.2.2 Local consistency
For a target object, the mixup operation often causes disconnected structure in the mixed image. This phenomenon makes it particularly difficult for a segmentation model to learn the shape priors of target objects.
Leveraging the fact that the target structure can be interconnected in many medical applications, we propose the local consistency to eliminate the discrete results. For unmixed images and , the local consistency loss is formulated as:
where, is a morphological operation on a segmentation result, which outputs the largest connected area of each non-background class in the input segmentation. The purpose of Eq. (13) is to minimize the distance between segmentation results and their largest connected areas. As formulated in Eq. (11), we use the symmetrical negative cosine similarity as the metric of distance.
Finally, the training objective is formulated as:
are hyperparameters to leverage the relative importance of different loss components.
4.1 Data and evaluation metric
CycleMix is evaluated on two open datasets, i.e., ACDC and MSCMRseg, on which rich results have been reported in literature for comparisons. In addition, we use ACDC dataset for extensive parameter studies.
ACDC dataset is composed of 2-dimensional cine-MRI images from 100 patients. The cine-MRI images were obtained using two MRI scanners of various magnetic strengths and different resolutions. For each patient, manual annotations of right ventricle (RV), left ventricle (LV) and myocardium (MYO) are provided for both the end-diastolic (ED) and end-systolic (ES) phase. Following , the 100 subjects in ACDC dataset is randomly divided into 3 sets of 70 (training), 15( validation), 15 (test) subjects for experiments. To compare with the previous state-of-the-art methods, which use unpaired masks to learn shape priors, we further divided the training set into two halves, 35 training images with scribble labels and 35 mask images with heart segmentation. Unless specified, we only used 35 training images when training the proposed CycleMix and baselines.
MSCMRseg[8458220, Zhuang2016MultivariateMM] contains late gadolinium enhancement (LGE) MRI images collected from 45 patients who underwent cardiomyopathy, which represents more challenges for automatic segmentation than the unenhanced cardiac MRI. Gold standard segmentation of LV, MYO, RV of these images has also been released by the organizers. Following [yue2019cardiac], we randomly divided the images from 45 patients into 3 sets, including 25 for training, 5 for validation and 20 for test.
Scribble annotations. For ACDC dataset, we used the released expert-made scribble annotations . To obtain realistic scribble annotations, we further manually annotate the MSCMRseg dataset, following the principles in . The average image coverages of scribbles for background, RV, MYO, LV are 3.4%, 27.7%, 31.3%, and 24.1%, respectively. Figure. 3 presents two exemplar images and their annotations from the two datasets. Please refer to supplementary material for more details of scribble annotations.
Evaluation. We adopted the Dice coefficient [10.2307/1932409] to evaluate the performance of each method, which gauges the similarity of two segmentation masks.
4.2 Experimental setup
Implementation Details. We adopted the 2D variant of UNet [baumgartner2017exploration], denoted as UNet
, as the network architecture of CylceMix for all experiments, which was implemented using Pytorch. Since the provided images have different resolutions, We first resampled them and their annotations into a common in-plane resolution of
mm. Then, all images were cropped or padded to the same image size of
pixel. During training, we normalized the intensity of each image to zero mean and unit variance. The learning rate was fixed to 0.0001. We empirically setand in Eq.(14
). All models were trained using one single NVIDIA 3090Ti 24GB GPU for 1000 epochs.
Baseline settings. The proposed CycleMix was trained with scribble annotations. Firstly, we compared it with baselines trained on scribble-annotated datasets. Recently, there are several works leveraged GAN networks to learn shape priors. We also compared with these challenging benchmarks which require extra unpaired segmentation masks to train GAN networks. Finally, we considered several supervised methods as upper bounds, which were trained on fully-annotated datasets.
Baselines: We first compared to UNet trained with cross entropy loss of annotated pixels in [Tang2018OnRL]. Then, we applied different mix-up augmentation strategy to UNet, i.e., MixUp [zhang2018mixup], CutMix [yun2019cutmix], Puzzle Mix [kimICML20], Co-mixup [kim2021comixup]. Finally, we included the experiment results on ACDC dataset reported in  for reference, i.e., UNet[tang2018normalized],UNet, UNet [zheng2015conditional], TS-UNet [Can2018LearningTS].
Challenging benchmarks: The above baselines do not leverage additional unpaired segmentation masks during training. For more challenging benchmarks, we compared with four works using extra unpaired data to learn shape priors, including PostDAE [Larrazabal2020PostDAEAP], UNet , ACCL [zhang2020accl], MAAG . We refered to their segmentation results reported in  on ACDC dataset for comparison.
Supervised methods: Finally, we performed the comparison in fully-supervised segmentation. Firstly, we applied UNet in [baumgartner2017exploration] to the training data of full annotations using conventional cross entropy loss, referred to as UNet. Then, we applied Puzzle Mix augmentation strategy to UNet, and obtained the Puzzle Mix. Finally, we trained CycleMix with fully annotated data, denoted as CycleMix, and compared with UNet and Puzzle Mix on both ACDC and MSCMRseg datasets.
4.3 Comparison with different mix-up strategies
Table. 1 presents the performance of CycleMix on ACDC and MSCMRseg datasets. We compared with different data augmentation methods, i.e., Mixup, Cutout, CutMix, Puzzle Mix, Comix-up as strong baselines. Here, we used 35 subjects for training, and the results using 70 training images are presented in supplementary material.
When only scribble annotations are available, Puzzle Mix achieved poor performance, with average Dice Scores of on ACDC dataset and on MSCMRseg. When with our proposed augmentation and regularization of supervision, CycleMix boosted the performance to reach Dice of and for the two datasets, respectively, demonstrating improvements of and .
Furthermore, the average Dice Score of CycleMix not only surpassed all weakly-supervised baselines by a large margin, but also exceeded the two fully-supervised methods. Particularly on the challenging task of MSCMRseg dataset, CycleMix achieved average Dice 0.800, with increment than CutMix which ranks the second in the scribble supervision leader board. For the fully-supervised methods, one can observe CycleMix (marginally) outperformed both UNet and Puzzle Mix in Table. 1. Specifically, CycleMix with scribble supervision obtained an average improvement of (84.8% vs 84.0%) and (80.0% vs 78.9%) on MSCMRseg and ACDC dataset, respectively.
Figure. 4 visualizes results on the worst and median cases selected using the fully-supervised UNet. It is observed that Puzzle Mix could fail in the scribble supervision-based segmentation, especially on the challenging task of MSCMRseg. This may be due to its transportation strategy of image patches, which is more likely to change the shape of the target structure than other mix-up strategies based on linear interpolation or local replacement. Similar behavior could be seen from Co-mixup which adopts the similar transportation strategy to that of Puzzle Mix. Therefore, it is more difficult for the segmentation model to learn the shape priors, especially in the case of a small training dataset. CycleMix overcomes this disadvantage by combining losses of mixed images and unmixed images, i.e., and , and leveraging consistency regularization to preserve shape priors, which will be further explored in the ablation study.
4.4 Comparison with weakly-supervised methods
Table. 2 presents the results on the ACDC dataset. The previous best method, MAMG  exploited the unpaired masks from 35 additional subjects, and achieved 81.6% Dice Score with the assistance of multi-scale GAN. Without these masks, CycleMix still achieved a new state-of-the-art (SOTA) Dice of 84.8% average, with a promising margin over MAMG. For the RV structure with more shape variation, CycleMix obtained remarkable gains of 11.1% over MAMG (86.3% vs 75.2%). For the other methods, CycleMix demonstrated more significant performance improvements. We concluded that despite the additional masks, the models could learn very limited prior shapes through GAN when the number of training images is small. Thanks to the mix augmentation and consistency regularization for scribble supervision, CycleMix learned robust shape priors and set a new SOTA of segmentation.
Moreover, as one can observed from the upper part of Table. 2, CycleMix consistently outperformed all the other scribble supervision-based methods. Particularly, CycleMix obtained average performance gain up to than UNet which ranks the 2nd.
|35 scribbles + 35 unpaired masks|
4.5 Ablation study
This section studies the effectiveness of our proposed strategies, including the usage of unmix loss (), mixed loss (), global consistency loss (), random occlusion (), and local consistency loss (). Table. 3 presents the details.
Effectiveness of global consistency: UNet (#1) with cross entropy loss of annotated pixels could achieve the average Dice Score of 75.2%. When we added mixed loss as additional segmentation loss, the average performance increased by (75.2% to 80.9%); and when the global consistency () was included for regularization, the average Dice was further boosted to 83.0%. This was attribute to the fact that the combination of global consistency could encourage segmentation model to learn the mix-invariant property, and enhance the ability of model to learn robust shape priors.
Effectiveness of random occlusion: For model #4, we observed that random occlusion () brought a convincing average Dice Score improvement of ( vs ), demonstrating its effectiveness to enhance the localization ability of model via additional augmentation of scribble supervision.
Effectiveness of local consistency: When local consistency () was adopted for shape regularization, model #5 performed marginally better than model #4, with an increase of average Dice Score (84.8% vs 84.0%). Particularly on MYO structure, helped obtaining a statistically significant improvement of Dice, indicating the benefit of local consistency in shape regularization for segmentation of challenging structures.
4.6 Data sensitivity study
This study investigates the performance of CycleMix with different training images of scribble annotation and full annotation. For this study, we included all the 70 training images from ACDC and altered the ratio between the two sets of annotations. Table. 4 presents the results.
Interestingly, one can observe that when the ratio of full annotation reaches 20% (56:14), CycleMix outperformed the fully-supervised UNet by a margin of (87.2% vs 86.2%) on the average Dice. As expected, the performance of CycleMix tended to increase as the ratio of fully-annotated subjects increases. One can observe that the general performance of CycleMix converged when the ratio of fully-annotated data reaches 40%. This confirms that CycleMix could achieve a satisfactory segmentation result with a relatively small amount of full annotations.
4.7 Experiments on fully-annotated data
Table. 5 provides the Dice Score of fully-supervised segmentation on ACDC and MSCMRseg datasets. With fully-annotated labels, Puzzle Mix demonstrated competitive performance, improving the average Dice of Unet from to on ACDC, and from to on MSCMRseg. By contrast, CycleMix could improve more, but the margins were not so exciting as it did in the scribble supervision. This indicates that CycleMix can excel in both scribble supervision-based and fully-supervised segmentation, but its advantage could be more evident in the former applications, for which CycleMix has been specifically designed.
In this paper, we have investigated a novel weakly-supervised learning framework, CycleMix, to learn segmentation from scribble supervision. The proposed method utilizes mix augmentation of supervision and cycle consistency of segmentation to enhance the generalization ability of segmentation models. CycleMix was evaluated on two open datasets, i.e., ACDC and MSCMRseg, and achieved the new state-of-the-art performance.