[MICCAI2021] This is an official PyTorch implementation for "Duo-SegNet: Adversarial Dual-Views for Semi-Supervised Medical Image Segmentation"
Segmentation of images is a long-standing challenge in medical AI. This is mainly due to the fact that training a neural network to perform image segmentation requires a significant number of pixel-level annotated data, which is often unavailable. To address this issue, we propose a semi-supervised image segmentation technique based on the concept of multi-view learning. In contrast to the previous art, we introduce an adversarial form of dual-view training and employ a critic to formulate the learning problem in multi-view training as a min-max problem. Thorough quantitative and qualitative evaluations on several datasets indicate that our proposed method outperforms state-of-the-art medical image segmentation algorithms consistently and comfortably. The code is publicly available at https://github.com/himashi92/Duo-SegNetREAD FULL TEXT VIEW PDF
[MICCAI2021] This is an official PyTorch implementation for "Duo-SegNet: Adversarial Dual-Views for Semi-Supervised Medical Image Segmentation"
In this paper, we propose a semi-supervised technique based on the concept of multi-view learning  to segment medical images. Accurate segmentation of medical images is a key-step in developing Computer-Aided Diagnosis (CAD) and automating various clinical tasks such as image-guided interventions.
The prevailing idea in medical image segmentation is to employ an encoder-decoder structure (e.g., UNet  and its variants [20, 11]) and formulate the problem as a dense classification/regression problem, depending on the type of input and the desired output. Most of the existing algorithms are breeze through supervised setting when sizable, annotated datasets are available. However, annotating large-scale datasets for image segmentation is challenging and expensive. On one hand, most medical image modalities are hefty in size (e.g., 3D volumes as in MRI and CT) and hence annotation is extremely laborious. On the other hand, annotating medical images requires expert knowledge and cannot be crowd-sourced. Add to this the fact that medical images often contain low contrast slices and ambiguous regions, which in turn makes annotation very difficult. While attaining large annotated datasets is challenging, unlabeled data comes (almost) for free and is abundant.
Multi-view learning makes use of multiple distinct views of data and benefits from the resulting relationships to achieve accurate models. The principle of consensus  states that by minimizing the disagreement on multiple distinct hypotheses, the error of each hypothesis will be minimized. To be specific, suppose and are two distinct hypothesis defined on a distribution. Under some mild assumptions and as shown in 
That is, reducing the disagreement of the two hypotheses minimizes the error of each hypothesis. The consensus principle provides an efficient way with strong theoretical properties to benefit from unlabeled data as shown for example in the celebrated work of Blum and Mitchell . Except a handful of studies [14, 12], multi-view learning has been mostly studied under the classification regime [6, 5]. This raises a natural question, is multi-view learning beneficial when it comes to segmentation? and if yes, how it must be formulated? Our work takes a step in this direction and provides a way to cultivate unlabeled data for image segmentation. In particular, we propose a dual-view UNet model and equip it with a critic network. Each UNet provides a view of the data distribution and will be trained by minimizing a supervised loss on the labeled data and a disagreement loss over the unlabeled data. The critic is used to facilitate two objectives, 1. to ensure that the output of UNets resembles the ground-truth segmentation masks (in the ideal case, the critic is not able to distinguish whether its input is a ground-truth mask or a prediction mask from the UNets), and 2. to identify confident parts of a prediction mask to enforce agreement across the views. We should stress that unlike classification where (dis)agreement can be formulated readily between predictions, in segmentation, we face a dense prediction problem, meaning a prediction mask includes tens of thousands, if not more, predictions at pixel-level. A naive treatment of (dis)agreement loss could lead to inferior results as simply the networks can overfit to agree on the background, which is the dominant part of the prediction mask in many cases. Therefore, to train the Duo-SegNet, we formulate the learning as a min-max problem by allowing a critic to stand in as a quantitative subjective referee. Thorough empirical evaluations demonstrate that our method performs well both qualitatively and quantitatively, utilizing small fraction of labeled data.
In short, we have made the following contributions in this work, 1. We propose a dual-view learning scheme for semi-supervised medical image segmentation. 2. We make use of a critic to regularize the training and identify confident parts of a prediction mask towards learning from unlabelled data.
We start this section by providing an overview of the Duo-SegNet (see Fig. 1 for a conceptual illustration). Let be a labeled set, where each pair consists of an image and its associated ground-truth mask . Furthermore, let be a set of unlabelled images with . The primary objective is to learn a segmentation model from .
The Duo-SegNet includes two basic modules, a dual view segmentation networks and a critic, shown by , , and respectively in Fig. 1. Each leg in the dual-view network (i.e., and ) benefits from an encoder-decoder design, and is structured as a UNet . In this work, by incorporating a critic, we implicitly enforce segmentation networks to create predictions that are more similar to the desired masks holistically. The parameters of the model are the parameters of the dual-view network , and the critic network . We collectively show the parameters of the dual-view network (i.e., , ) by to avoid cluttering the equations. To train the model, we propose optimizing the following min-max problem:
The familiar min-max problem in (1
) encourages the dual-view segmentation networks to yield segmentation masks that look like realistic ones by deceiving the critic. Similar to standard multi-view training, our approach adapted dual view training to train two segmentation models collaboratively. We propose to train dual view segmentation networks by minimizing a multitask loss function consists of three loss terms:
where , , and denote the supervised, the unsupervised and the critic loss respectively. Furthermore, are hyper-parameters of the algorithm, controlling the contribution of each loss term. We note that the supervised and unsupervised loss are only dependent on the dual-view networks while the critic loss is defined based on the parameters of the whole model. In practice, we find that tuning the hyper-parameters of the network is not difficult at all and the Duo-SegNet works robustly as long as these parameters are defined in a reasonable range. For example, in all our experiments in this paper, we set , and . To better understand Eq. (2), we start with the supervised loss. This loss makes use of the labeled data and can be defined as a cross-entropy (i.e., Eq. (3)) or dice loss (i.e., Eq. (4)) or even a linear combination of both, which is a common practice in segmentation .
where we use and .
We define the unsupervised loss as a means to realize the principle of consensus as;
We identify Eq.(5) as a symmetric form of the cross-entropy loss. In essence, the unsupervised loss acts as an agreement loss during dual-view training. The idea is that the two segmentation networks should generate similar segmentation masks for unlabeled data. In the proposed method, diversity among two views are retrieved based on segmentation model predictions. Unlike in [12, 14] where adversarial examples for a model are used to teach other model in the ensemble, we make use of a critic to approach inter diversity among segmentation models for both labeled and unlabeled distributions. When using the labeled data, two segmentation networks are supervised by both the supervised loss with the ground truth and the adversarial loss. We denote the functionality of the critic by and define the normalized loss of critic for labeled prediction distribution as:
where if the sample is generated by the segmentation network, and if the sample is drawn from the ground truth labels. For unlabeled data, it is obvious that we cannot apply any supervised loss since there is no ground truth annotation available. However, adversarial loss can be applied as it only requires the knowledge about whether the mask is from the ground-truth labels or generated by the segmentation networks. The adversarial loss for the distribution of unlabeled predictions is defined as:
The basis of having a critic is that a well trained critic can produce pixel-wise uncertainty map/confidence map which imposes a higher-order consistency measure of prediction and ground truth. So, it infers the pixels where the prediction masks are close enough to the ground truth distribution.
Therefore, the critic loss is defined as:
With this aggregated adversarial loss, we co-train two segmentation networks to fool the critic by maximizing the confidence of the predicted segmentation being generated from the ground truth distribution. To train the critic, we use labeled prediction distribution along with given ground truth segmentation masks and calculate the normalized loss as in Eq.(6). The overall min-max optimization process is summarized in Algorithm 1.
We begin by briefly discussing alternative approaches to semi-supervised learning, with a focus on the most relevant ones to our approach, namely pseudo labelling, mean teacher model , Virtual Adversarial Training (VAT) , recently published deep co-training . Our goal here is to discuss the resemblance and differences between our proposed approach and some of the methods that have been adopted for semi-supervised image segmentation.
In Pseudo Labelling, as the name implies, the model uses the predicted labels of the unlabeled data and treat it as ground-truth labels for training. The drawback here is, sometimes incorrect pseudo labels may diminish the generalization performance and weaken the training of deep neural networks. In contrast, Duo-SegNet by incorporating a critic, increases the tolerance of these incorrect pseudo labels which stabilizes the generalization performance. Similar to Duo-SegNet, the mean teacher approach benefits from two neural networks, namely the teacher and the student networks. While the student model is trained in a stochastic manner, the parameters of the teacher model are updated slowly by a form of moving averaging of the student’s parameters. This, in turn, results in better robustness to prediction error as one could hope that averaging attenuates the effect of noisy gradient (as a result of incorrect pseudo labels). Unlike mean teacher model, Duo-SegNet simultaneously train both networks and models can learn from one another during training. VAT can be understood as an effective regularization method which optimizes generalization power of a model for unlabeled examples. This is achieved by generating adversarial perturbations to the input of the model, followed by making the model robust to the adversarial perturbations. In contrast to VAT, Duo-SegNet makes use of a critic to judge if predictions are from the same or different distribution compared to labeled examples. With this, segmentation networks are encouraged to generate similar predictive distribution for both labeled and unlabeled data. Similar to our work, in co-training two models are alternately trained on distinct views, while learning from each other is encouraged. Recently, Peng et al. introduced a deep co-training method for semi-supervised image segmentation task  based on the approach presented by Qiao et al. for image recognition task in . In this approach diversity among views are achieved via adversarial examples following VAT 
. We note that adversarial examples, from a theoretical point of view, cannot guarantee diversity, especially when unlabeled data is considered. That is, a wrong prediction can intensify even more once adversarial examples are constructed from it. In contrast, inspired by Generative Adversarial Networks(GANs) and GAN based medical imaging applications including medical image segmentation , reconstruction  and domain adaptation , our proposed method graciously encloses the min-max formulation in dual-view learning for segmenting medical images where high-confidence predictions for unlabeled data are leveraged, which is simple and effective.
The proposed model is developed in PyTorch
. Training was done from scratch without using any pre-trained model weights. For training of segmentation network and critic, we use SGD optimizer(LR=1e-02) and RMSProp optimizer(LR=5e-05), respectively. We divide the original dataset into training (80%) and test set (20%). Experiments were conducted for 5%, 20% and 50% of labeled training sets.
We use three medical image datasets for model evaluation covering three medical image modalities : 670 Fluorescence Microscopy (FM) images from Nuclei , 20 MRI volumes from Heart  and 41 CT volumes from Spleen . For our experiments, 2D images are obtained by slicing the high-resolution MRI and CT volumes for Heart (2271 slices) and Spleen (3650 slices) datasets. Each slice is then resized to a resolution of .
We compare our proposed method with Mean Teacher , Pseudo Labelling , VAT , Deep Co-training  and fully supervised U-Net . For all baselines, we follow the same configurations as for our method. All approaches are evaluated using : 1) Dice Srensen coefficient (DSC) and 2) Mean Absolute Error (MAE).
The qualitative results for the proposed and competing methods are shown in Fig. 3. The quantitative results comparison of the proposed method to the four state-of-the-art methods are shown in Table 1. The results reveal that the proposed method comfortably outperform other studied methods for smaller fractions of annotated data (e.g. Spleen 5%). The gap between the Duo-SegNet and other competitors decreases on the Nuclei dataset, when the amount of labeled data increases. That said, we can still observe a significant improvement on the Heart and Spleen dataset. The proposed network can produce both accurate prediction masks and confidence maps representing which regions of the prediction distribution are close to the ground truth label distribution. This is useful when training unlabeled data. Fig. 2 shows the visual analysis of confidence maps.
|Input||GT||Fully Supervised||Mean Teacher||Pseudo Labelling||VAT||Deep Co-training||Duo-SegNet (Ours)|
We also perform ablation studies to show the effectiveness of adding a critic in dual-view learning in semi-supervised setting and the importance of dual view network structure. In our algorithm, we benefit from unlabeled data via (1) criss-cross exchange of confident regions, (2) improving the critic which in essence minimizes an upper-bound of error. To justify this, we conducted an additional experiment without unlabeled data. It can be seen that there is an impact in the performance of segmentation model for varying values for . For our experiments we choose in the range of to . All experiments in Table 2 are conducted for spleen dataset with 5% of annotated data.
We proposed an adversarial dual-view learning approach for semi-supervised medical image segmentation and demonstrated its effectiveness on publicly available three medical datasets. Our extensive experiments showed that employing a min-max paradigm into multi-view learning scheme sharpens boundaries between different regions in prediction masks and yield a performance close to full-supervision with limited annotations. The dual view training can still be improved by self-tuning mechanisms, which will be considered in our future works.
2018 data science bowl. Note: https://www.kaggle.com/c/data-science-bowl-2018 Cited by: §4.0.2.
Proc. of the eleventh annual conference on Computational learning theory, pp. 92–100. Cited by: §1.
A co-training approach for multi-view spectral clustering. In
Proceedings of the 28th international conference on machine learning (ICML-11), pp. 393–400. Cited by: §1.
Proc. European Conference on Computer Vision (ECCV), pp. 135–152. Cited by: §1, §2, §3.