Deep convolutional neural networks have achieved remarkable progress on a variety of medical image computing tasks. A common problem when applying supervised deep learning methods to medical images is the lack of labeled data, which is very expensive and time-consuming to be collected. In this paper, we present a novel semi-supervised method for medical image segmentation, where the network is optimized by the weighted combination of a common supervised loss for labeled inputs only and a regularization loss for both labeled and unlabeled data. To utilize the unlabeled data, our method encourages the consistent predictions of the network-in-training for the same input under different regularizations. Aiming for the semi-supervised segmentation problem, we enhance the effect of regularization for pixel-level predictions by introducing a transformation, including rotation and flipping, consistent scheme in our self-ensembling model. With the aim of semi-supervised segmentation tasks, we introduce a transformation consistent strategy in our self-ensembling model to enhance the regularization effect for pixel-level predictions. We have extensively validated the proposed semi-supervised method on three typical yet challenging medical image segmentation tasks: (i) skin lesion segmentation from dermoscopy images on International Skin Imaging Collaboration (ISIC) 2017 dataset, (ii) optic disc segmentation from fundus images on Retinal Fundus Glaucoma Challenge (REFUGE) dataset, and (iii) liver segmentation from volumetric CT scans on Liver Tumor Segmentation Challenge (LiTS) dataset. Compared to the state-of-the-arts, our proposed method shows superior segmentation performance on challenging 2D/3D medical images, demonstrating the effectiveness of our semi-supervised method for medical image segmentation.READ FULL TEXT VIEW PDF
Segmenting anatomical structural or abnormal regions from medical images, such as dermoscopy images, fundus images and 3D computed tomography (CT) scans, is of great significance for clinical practice, especially for disease diagnosis and treatment planning.
Recently, deep learning techniques have made impressive progress on semantic image segmentation tasks and become a popular choice in both computer vision and medical imaging community
Segmenting anatomical structural or abnormal regions from medical images, such as dermoscopy images, fundus images and 3D computed tomography (CT) scans, is of great significance for clinical practice, especially for disease diagnosis and treatment planning. Recently, deep learning techniques have made impressive progress on semantic image segmentation tasks and become a popular choice in both computer vision and medical imaging community[1, 2]. The success of deep neural networks usually relies on the massive labeled dataset. However, it is hard and expensive to obtain labeled data, notably in medical imaging domain where only experts can provide reliable annotations . For example, there are thousands of dermoscopy image records in the clinical center, but melanoma delineation by experienced dermatologists is very scarce; see Figure 1. Such cases can also be observed in the optic disc segmentation from the retinal fundus images, and especially in liver segmentation from CT scans, where delineating organs from volumetric images in a slice-by-slice manner is very time-consuming and expensive. The lack of the labeled data motivates the study of methods that can be trained with limited supervision, such as semi-supervised learning [4, 5, 6], weakly supervised learning [7, 8, 9] and unsupervised domain adaptation [10, 11, 12], etc. In this paper, we focus on semi-supervised segmentation approaches, considering that it is relatively easy to acquire a large amount of unlabeled medical image data.
Semi-supervised learning aims to learn from a limited amount of labeled data and an arbitrary amount of unlabeled data,
which is a fundamental, challenging problem and have a high impact in real-world clinical applications. The semi-supervised problem has been widely studied in medical image research community [13, 14, 15, 16, 17].
Recent progress in semi-supervised learning for medical image segmentation has featured deep learning [18, 5, 19, 20, 21]. Bai et al.  present a semi-supervised deep learning model for cardiac MR image segmentation, where the segmented label maps from unlabeled data are incrementally added into the training set to refine the segmentation network.
Other semi-supervised learning methods are based on the recent techniques, such as variational autoencoder (VAE)
present a semi-supervised deep learning model for cardiac MR image segmentation, where the segmented label maps from unlabeled data are incrementally added into the training set to refine the segmentation network. Other semi-supervised learning methods are based on the recent techniques, such as variational autoencoder (VAE) and generative adversarial network (GAN) . We tackle the semi-supervised segmentation problem from a different point of view. With the success of self-ensembling model in the semi-supervised classification problem , we further advance the method to medical image segmentation tasks, including 2D cases and 3D cases.
In this paper, we present a novel semi-supervised learning method based on self-ensembling strategy for medical image segmentation tasks. The whole framework is trained with a weighted combination of the supervised loss and the unsupervised loss. The supervised loss is designed to utilize the labeled data for accurate prediction. To leverage the unlabeled data, our self-ensembling method encourages a consistent prediction of the network for the same input data under different regularizations, e.g., randomized Gaussian noise, network dropout and randomized data transformation. In particular, we design our method to account for the challenging segmentation task, in which pixel-level classification is required to be predicted. We observe that in the segmentation problem, if one transforms (e.g., rotates) the input image, the expected prediction should be transformed in the same manner. Actually, when the inputs of CNNs are rotated, the corresponding network predictions would not rotate in the same way . In this regard, we take advantage of this property by introducing a transformation (i.e., rotation, flipping) consistent scheme at the input and output space of our network. Specifically, we design the unsupervised loss by minimizing the differences between the network predictions under different transformations of the same input. We extensively evaluate our methods for semi-supervised medical image segmentation on three representative segmentation tasks, i.e., skin lesion segmentation from dermoscopy images, optic disc segmentation from retinal images, and liver segmentation from CT scans. For training on 3D CT images, we conduct experiments with 2D and 3D convolutional neural networks, respectively. To train with 2D convolutional neural network, we slice the volumetric data to three adjacent slices and the result is the concatenation of the network output. We also show that our method performs well with the 3D convolutional neural network. In summary, our semi-supervised method achieves significant improvements compared with the supervised baseline, and also outperforms other semi-supervised segmentation methods. A preliminary version of this work was presented in . The main contributions of this paper are:
We present a simple and effective semi-supervised segmentation method for various medical images segmentation tasks. Our method is flexible and can be easily applied on both 2D and 3D convolutional neural networks.
To better utilize the unlabeled data for segmentation tasks, we proposed a transformation consistent self-ensembling model (TCSM), which shows effectiveness for the semi-supervised segmentation problem.
Extensive experiments on three representative yet challenging medical image segmentation tasks, including 2D and 3D datasets, demonstrate the effectiveness of our semi-supervised method over other methods.
Our method excels other state-of-the-arts and establishes a new record in the ISIC 2017 skin lesion segmentation dataset with the semi-supervised method.
The remainders of this paper are organized as follows. We review the related techniques in Section II and elaborate the semi-supervised method in Section III. The experimental results and ablation analysis on dermoscopy images, retinal funds images and liver CT scans are shown in Section IV. We further discuss our method in Section V and draw the conclusions in Section VI.
Semi-supervised segmentation for medical images. Early works for semi-supervised medical image segmentation are mainly based on hand-crafted features [13, 14, 15, 16, 17].
For example, You et al.  combined radial projection and self-training learning to get an improved overall segmentation of retinal vessel from fundus image.
Portela et al.  presented a clustering based semi-supervised Gaussian Mixture Model (GMM) to automatically segment brain MR images.
Later on, Gu designed a semi-supervised method based on K-means clustering and flood fill algorithm.
However, these semi-supervised methods are based on hand-crafted features, which suffer from limited representation capacity.
presented a clustering based semi-supervised Gaussian Mixture Model (GMM) to automatically segment brain MR images. Later on, Guet al.  proposed a semi-supervised method for vessel segmentation by constructing forest oriented super pixels. For skin lesion segmentation, Jaisakthi et al. 
designed a semi-supervised method based on K-means clustering and flood fill algorithm. However, these semi-supervised methods are based on hand-crafted features, which suffer from limited representation capacity.
Recent progress for semi-supervised segmentation has featured deep learning. An iterative approach is proposed by Bai et al.  for cardiac segmentation from MR images, where the network parameters and the segmentation masks for the unlabeled data are alternatively updated. Generative model based semi-supervised approaches are also popular in medical image analysis community [5, 19, 25, 26]. Sedai et al.  introduced a variational autoencoder (VAE) for optic cup segmentation from retinal fundus images. They learned the feature embedding from unlabeled images using VAE, and then combined the feature embedding with the segmentation autoencoder trained on the labeled images for pixel-wise segmentation of the cup region. To involve the unlabeled data in the training, Nie et al.  presented an attention-based GAN approach to select the trustworthy regions of the unlabeled data to train the segmentation network. Another GAN based work  employed cycle-consistency principle and worked on cardiac MR image segmentation. More recently, Ganaye et al.  proposed a semi-supervised method for brain structures segmentation by taking advantage of the invariant nature and semantic constraint of anatomical structures. Multi-view co-training based methods [4, 27] have been explored on 3D medical data. Differently, our method takes the advantage of transformation consistency and self-ensembling model, which is simple yet effective for the medical image segmentation tasks.
Transformation equivariant representation. There is a body of related literature on equivariance representations, where the transformation equivariance is encoded into the network to explore the network equivariance property [28, 29, 23]. For example, Cohen and Welling  proposed group equivariant neural network to improve the network generalization, where equivariance to -rotations and dihedral flips is encoded by copying the transformed filters at different rotation-flip combinations. Concurrently, Dieleman et al.  designed four different equivariance to preserve feature map transformations by rotating feature maps instead of filters. Recently, Worrall et al.  restricted the filters to circular harmonics to achieve continuous -rotations equivariance. However, these works aim to encode equivariance into the network to improve the generalization capability of the network, while our method targets to better utilize the unlabeled data in the semi-supervised learning.
Medical image segmentation.
Early approaches on medical image segmentation mainly focused on thresholding , statistical shape models  and machine learning related methods
and machine learning related methods[32, 33, 34, 35, 36]. Recently, many researchers employed deep learning based methods for medical image segmentation [37, 38, 39]. These deep learning based methods achieved promising results on skin lesion segmentation, optic disc segmentation and liver segmentation [40, 41, 42, 43, 44]. Yu et al.  explored the network depth property and developed a deep residual network for automatic skin lesion segmentation, where several residual blocks were stacked together to increase the network representative capability. Yuan et al.  presented a 19-layer deep convolutional neural network and trained it in an end-to-end manner for skin lesion segmentation. As for optical disc segmentation, Fu et al.  presented a M-Net for joint OC and OD segmentation. And a disc-aware network  was designed for glaucoma screening by an ensemble of different feature streams of the network. For liver segmentation, Chlebus et al.  presented a cascaded FCN combined with hand-crafted features. Li et al.  presented a 2D-3D hybrid architecture for liver and tumor segmentation from CT images. Although these approaches achieve good results in the experiments, they are based on fully supervised learning, requiring massive pixel-wise annotations from experienced dermatologists or radiologists.
Figure 2 is the overview of our proposed transformation consistent self-ensembling model (TCSM) for semi-supervised medical image segmentation.
The transformation operation are added in a standard fully convolution neural network (FCN). The total loss function is the weighted combination of the cross-entropy loss and mean square error, where the cross-entropy loss is optimized on the labeled data and mean square error loss is calculated on both labeled and unlabeled data.
The framework is trained for the medical image segmentation in the semi-supervised way.
is the overview of our proposed transformation consistent self-ensembling model (TCSM) for semi-supervised medical image segmentation. The transformation operation are added in a standard fully convolution neural network (FCN). The total loss function is the weighted combination of the cross-entropy loss and mean square error, where the cross-entropy loss is optimized on the labeled data and mean square error loss is calculated on both labeled and unlabeled data. The framework is trained for the medical image segmentation in the semi-supervised way.
To ease the description of our method, we first formulate the semi-supervised segmentation task in general, where the training set consists inputs in total, including labeled inputs and unlabeled inputs. We denote the labeled set as and the unlabeled set as , where is the input image and is the ground-truth label for 2D medical images, e.g., retinal fundus image and dermoscopy images. The general semi-supervised segmentation learning tasks can be formulated to learn the network parameters by optimizing:
where denotes the supervised loss function and represents the regularization (unsupervised) loss. denotes the segmentation neural network. The first item in the loss function is trained by the cross-entropy loss, aiming at evaluating the correctness of network output on labeled inputs only. The second item is optimized with a regularization loss, which utilizes both labeled and unlabeled inputs. is a weighting factor that controls how strong the regularization is.
Recent progress on semi-supervised learning show promising results with self-ensembling methods [47, 22]. The key point to this success relies on the key smoothness assumption; that is, data points close to each other in the image space are likely to be same in the label space. Specifically, these methods focus on improving the quality of targets by using self-ensembling and exploring different perturbations. The perturbations include the input noise and the network dropout. The network with the regularization loss encourages the predictions to be consistent and is expected to give better predictions. The regularization loss can be described as:
where and refers to different regularization or perturbations of input data. In our work, we share the same spirit with these methods by designing different perturbations for input data. Specifically, we design the regularization term as a consistency loss to encourage smooth predictions for the same data under different regularization or perturbations (e.g., Gaussian noise, network dropout, and randomized data transformation).
In this subsection, we will introduce how to effectively design the randomized data transformation regularization for the segmentation problem, i.e., the transformation consistent self-ensembling model (TCSM). In the general self-ensembling semi-supervised learning, most regularization and perturbations are easily designed for the classification problem. However, in the medical image domain, the accurate segmentation of important structures or lesions is a very challenging, practical problem and the perturbations for segmentation tasks are more worthy to explore. One prominent difference between these two common tasks is that the classification problem is transformation invariant while the segmentation task is expected to be transformation equivariant. Specifically, for image classification, the convolutional neural network only recognize the presence or absence of an object in the whole image. In other words, the classification result should remain the same, no matter what the data transformation (i.e., translation, rotation, and flipping) are applied to the input image. However, for image segmentation, if the input image is rotated, the segmentation mask is expected to have the same rotation with the original mask, although the corresponding pixel-wise predictions are same; see examples in Figure 3 (a). However, in general, convolutions are not transformation (i.e., flipping, rotation) equivariant111Transformation in this work refers to flipping and rotation., meaning that if one rotates or flips the CNN input, then the feature maps do not necessarily rotate in a meaningful manner , as shown in Figure 3 (b). Therefore, the convolutional network consisting of a series of convolutions is also not transformation equivariant. Formally, every transformation of input x associates with a transformation of the outputs; that is but in general .
This phenomenon limits the unsupervised regularization effect of randomized data transformation for the segmentation problem . To enhance the regularization and more effectively utilize unlabeled data in our segmentation task, we introduce a transformation consistent scheme in the unsupervised regularization term. Specifically, this transformation consistent scheme is embedded into the framework by approximating to at the input and output space. The detailed illustration of the framework is shown in Figure 2, and the pseudocode is presented in Algorithm 1. Under the transformation consistent scheme and other different perturbations (e.g., Gaussian noise and network dropout), each input is fed into the network for twice evaluation to acquire two outputs and . More specifically, the transformation consistent scheme consists of triple operations; see Figure 2. For one training input , in the first evaluation, the operation is applied to the input image while in the second evaluation, the operation is applied on the prediction map. Random perturbations (e.g., Gaussian noise and network dropout) are also applied in the network during the twice evaluation. By minimizing the difference between and with a mean square error loss function, the network is regularized to be transformation consistent and thus increase the network generalization capacity. Notably, the regularization loss is evaluated on both labeled and unlabeled inputs. To utilize the labeled data , the same operation is also performed on and optimized by the standard cross-entropy loss. Finally, the network is trained by minimizing the weighted combination of unsupervised regularization loss and supervised cross-entropy loss. Note that we employed the same data augmentation in the training procedure of all the experiments for fair comparison. However, our method is different from traditional data augmentation. Specifically, our method utilized the unlabeled data by minimizing network output difference under the transformed inputs, while complying with the smoothness assumption.
TCSM with 2D medical images
For dermoscopy images and retinal fundus images, we employ the 2D DenseUNet architecture  as our baseline model.
Compared to the standard DenseNet  , we add the decoder part for the segmentation tasks. The decoder part is four blocks and each block consists of "upsampling, convolutional, batch normalization and ReLU activation" layers.
The UNet-like skip connection is added between the final convolution layer of each dense block in the encoder part and the convolution layer in the decoder part.
The final prediction layer is a convolution layer with the channel number of 2.
Before the final convolution layer, we add a dropout layer with drop rate as 0.3.
, we add the decoder part for the segmentation tasks. The decoder part is four blocks and each block consists of "upsampling, convolutional, batch normalization and ReLU activation" layers. The UNet-like skip connection is added between the final convolution layer of each dense block in the encoder part and the convolution layer in the decoder part. The final prediction layer is a convolution layer with the channel number of 2. Before the final convolution layer, we add a dropout layer with drop rate as 0.3.
TCSM with 3D medical images To generalize our method to 3D medical images, e.g., liver CT scans, we train TCSM with 2D DenseUNet and 3D U-Net  respectively. For training DenseUNet on liver CT scans, the volumetric data including both raw images and volumetric labels is sliced into a large number of three adjacent slices. The middest slice in these adjacent slices is used as the ground-truth image. In the testing stage, the network output is the concatenation of the sequential test of three adjacent slices from volumetric images. For training with 3D U-Net, we follow the original setting with the following modifications. We modify the base filter parameters to 16 to accommodate this input size. The optimizer is SGD with learning rate 0.01. The batch normalization layer is employed to facilitate the training process and the loss function is modified to the standard weighted cross entropy loss.
Details of TCSM
The transformation consistent scheme includes the horizontal flipping operation as well as four kinds of rotation operations to the input with angles of where .
During each training pass, one operation is randomly chosen and applied.
We avoid the other angles for implementation simplification, but the proposed framework can be generalized to other angles in general.
To keep the balance of two terms in the loss function, we evenly and randomly select the labeled and the unlabeled samples in each minibatch.
The time-dependent warming up function is a weighting factor for supervised loss and regularization loss.
This weighting function is a Gaussian ramp-up curve , where denotes the training epoch and
denotes the training epoch andscales the maximum value of the weighting function. In our experiments, we empirically set as 1.0.
The model was implemented using Keras package  , and was trained with stochastic gradient descent (SGD) algorithm (momentum is 0.9 and minibatch size is 10).
The initial learning rate was 0.01 and decayed according to the equation
, and was trained with stochastic gradient descent (SGD) algorithm (momentum is 0.9 and minibatch size is 10). The initial learning rate was 0.01 and decayed according to the equation. We use the standard data augmentation techniques on-the-fly to avoid overfitting. The data augmentation includes randomly flipping, rotating as well as scaling with a random scale factor from 0.9 to 1.1. Note that all the experiments employed data augmentation for fair comparison.
In the inference phase, we remove the transformation operations in the network and do one single test with original input for fair comparison.
After getting the probability map from the network, we first apply thresholding with 0.5 to get the binary segmentation result, and then use morphology operation,
In the inference phase, we remove the transformation operations in the network and do one single test with original input for fair comparison. After getting the probability map from the network, we first apply thresholding with 0.5 to get the binary segmentation result, and then use morphology operation,i.e., filling holes, to get the final segmentation result.
To evaluate the effectiveness of our method, we conduct experiments on various modalities of medical images, including dermoscopy images, retinal fundus images and liver CT scans.
Dermoscopy image dataset. The dermoscopy image dataset in our experiments is the 2017 ISIC skin lesion segmentation challenge dataset .
It includes a training set with 2000 annotated dermoscopic images, a validation set with 150 images, and a testing set with 600 images.
The image size ranges from to .
To keep the balance of segmentation performance and computational cost, we first resize all the images to using bicubic interpolation.
using bicubic interpolation.
Retinal fundus image dataset. The fundus image dataset is acquired from MICCAI 2018 Retinal Fundus Glaucoma Challenge (REFUGE)222https://refuge.grand-challenge.org/iChallenge-AMD/. Manual pixel-wise annotations of the optic disc were obtained by seven independent ophthalmologists from Zhongshan Ophthalmic Center, Sun Yat-sen University, China. The experiments is conducted on the released training dataset, which contains 400 retinal images. The training dataset is randomly split to training and test set, and we resize all the images to using bicubic interpolation.
Liver segmentation dataset. The liver segmentation dataset are from 2017 Liver Tumor Segmentation Challenge (LiTS)333https://competitions.codalab.org/competitions/17094#participate-get_data . The LiTS dataset contains 131 and 70 contrast-enhanced 3D abdominal CT scans for training and testing, respectively. The dataset is acquired by different scanners and protocols from six different clinical sites, with a largely varying in-plane resolution from 0.55 mm to 1.0 mm and slice spacing from 0.45 mm to 6.0 mm.
For dermoscopy image dataset, we use
five evaluation metrics to measure the segmentation performance, including jaccard index (JA), dice coefficient (DI), pixel-wise accuracy (AC), sensitivity (SE) and specificity (SP).
The definition of them are:
For dermoscopy image dataset, we use five evaluation metrics to measure the segmentation performance, including jaccard index (JA), dice coefficient (DI), pixel-wise accuracy (AC), sensitivity (SE) and specificity (SP). The definition of them are:
where and refer to the number of true positives, true negatives, false positives, and false negatives, respectively. For retinal fundus image dataset, we use JA to measure the optic disc segmentation accuracy. For liver CT dataset, Dice per case score is employed to measure the accuracy of the liver segmentation result, according to the evaluation of 2017 LiTS challenge .
We report the performance of our method trained with only 50 labeled images and 1950 unlabeled images. Note that the labeled image is randomly selected from the whole dataset. Table I shows the experiments with supervised method, supervised with regularization, and our semi-supervised method on the validation dataset. We use the same network architecture (DenseUNet) in all these experiments for fair comparison. The supervised experiment is optimized by the standard cross-entropy loss on the 50 labeled images. The supervised with regularization experiment is also trained with 50 labeled images, but differently, the total loss function is weighted combination of the cross-entropy loss and the regularization loss, which is the same with our TCSM loss function. The TCSM experiment is trained with 50 labeled and 1950 unlabeled images in the semi-supervised manner. From Table I, it is obvious that our semi-supervised method can achieve higher performance than supervised counterpart on all the evaluation metrics, with prominent improvements of 2.46%, 2.64%, and 3.60% on JA, DI and SE, respectively. It is worth mentioning that supervised with regularization experiment improves the supervised training due to the regularization loss on the labeled images; see "supervised+regu" in Table I. The consistent improvements of "supervised+regu" on all evaluation metrics demonstrate the regularization loss is also effective for the labeled images. Figure 4 presents some segmentation results (red contour) of supervised method (left) and our method (right). Comparing with the segmentation contour achieved by supervised method (left column), the semi-supervised method fits more consistently with the ground-truth boundary. The observation shows the effectiveness of our semi-supervised learning method, i.e., TCSM, compared with the supervised method.
To show the effectiveness of the transformation consistent regularization scheme, we conduct ablation analysis of our method on the dermoscopy image dataset. We compare our method with the most common perturbations regularization, i.e., Gaussian noise and network dropout. Table II shows the experimental results, where "Ours-A" refers to semi-supervised learning with Gaussian noise and dropout regularization, "Ours-B" denotes to semi-supervised learning with transformation consistent regularization, and "Ours" refers to the experiment with all of these regularizations. Note that all experiments are conducted on the same training data with 50 labels with 1950 unlabeled data. As shown in Table II, both kinds of regularizations independently contribute to the performance gains of semi-supervised learning. The result improvement with transformation consistent regularization is very competitive, compared with the performance increment with Gaussian noise and dropout regularizations. We also observe that these two regularizations are complementary. When the two kinds of regularizations are employed, the performance can be further enhanced.
Table III shows the lesion segmentation results of our semi-supervised method (trained with labeled data and unlabeled data) and supervised method (trained only with labeled data) under different number of labeled/unlabeled images. We draw the JA score of the results in Figure 5. It is obvious that the semi-supervised method consistently performs better than the supervised method in different labeled/unlabeled data settings, which demonstrates that our method effectively utilizes unlabeled data and is beneficial to the performance gains. Note that in all semi-supervised learning experiments, we train the network with 2000 images in total, including labeled images and unlabeled images. As expected, the performance of supervised training increases when more labeled training images are available; see the blue line in Figure 5. At the same time, the segmentation performance of semi-supervised learning can also be increased with more labeled training images; see the orange line in Figure 5. The performance gap between supervised training and semi-supervised learning narrows as more labeled samples are available, which conforms with our expectation. When the amount of labeled dataset is small, our method can gain a large improvement, since the regularization loss can effectively leverage more information from the unlabeled data. Comparatively, as the number of labeled data increases, the improvement becomes limited. This is partially because the labeled and unlabeled data are randomly selected from the same dataset and a large amount of labeled data may reach the upper bound performance of the dataset.
From the comparison between the semi-supervised method and supervised method trained with 2000 labeled images in Figure 5, it is observed that our method increases the JA performance when all labels are used (from 79.60% to 79.95%). The improvement indicates the unsupervised loss can also provide a regularization to the labeled data. In other words, the consistency requirement in the regularization term can encourage the network to learn more robust features to improve the segmentation performance.
|Our Semi-supervised Method||300/1700||0.798||0.874||0.943||0.879||0.953|
|Yuan et al. ||2000/0||0.765||0.849||0.934||0.825||0.975|
|Venkatesh et al. ||0.764||0.856||0.936||0.83||0.976|
|Berseth et al. ||0.762||0.847||0.932||0.820||0.978|
|Bi et al. ||0.760||0.844||0.934||0.802||0.985|
|Shenzhen U (Lee)||0.718||0.810||0.922||0.789||0.975|
|Bai et al. ||DenseUNet||74.40%||1.55%|
|Hung et al. ||DenseUNet||73.31%||0.46%|
We compare our method with the latest semi-supervised segmentation method  in the medical imaging community and an adversarial learning based semi-supervised method . Note that the method  for medical image segmentation adopts the similar idea with the adversarial learning based method . For fair comparison, we re-implement their methods with the same network backbone on this dataset. We conduct experiments with the setting of 50 labeled images and 1950 unlabeled images. Table V shows the JA performance of different methods on the validation set. As shown in Table V, our proposed method achieves 2.46% JA improvement by utilizing unlabeled data. However, the methods of Bai et al.  and Hung et al.  can only enhance 1.55% and 0.46% improvement on JA, respectively. The comparison shows the effectiveness of our semi-supervised segmentation method, compared to other semi-supervised methods.
We also compare our method with state-of-the-art methods submitted to the ISIC 2017 skin lesion segmentation challenge. There are totally 21 submissions and the top results are listed in Table IV. Note that the final rank is determined according to JA on the testing set. We trained two models: semi-supervised learning model with 300 labeled images and 1700 unlabeled images, and supervised model with only 300 labeled data. The supervised model is denoted as our baseline model. As shown in Table IV, our semi-supervised method achieved the best performance on the benchmark, outperforming the state-of-the-art method  with 3.3% improvement on JA (from 76.5% to 79.8%). The performance gains on DI and SE are consistent with that on JA, with 2.5% and 5.4% improvement, respectively. Our baseline model with 300 labeled data also excels the some other methods due to the state-of-the-art network architecture. Based on this strong baseline, our semi-supervised learning method further makes significant improvements, which demonstrates the effectiveness of the overall semi-supervised learning method.
We report the performance of our method for optic disc segmentation from retinal fundus images. The 400 training images from REFUGE challenge were randomly separated to training and test dataset with the ratio of 9:1. For training semi-supervised model, only a portion of labels (i.e., 10% and 20%) in the training set were used. We preprocessed all the input images by subtracting the mean RGB values of all the training dataset. When training the supervised model, the loss function was traditional cross-entropy loss and we used SGD algorithm with learning rate 0.01 and momentum 0.9. To train the semi-supervised model, we added the extra unsupervised regularization loss, and the learning rate was changed to 0.001.
We report the JA performance of supervised and semi-supervised results under the setting of 10% labeled training images and 20% labeled training images, respectively. As shown in Table VII, we also report the other two representative semi-supervised methods. It is observed that our method achieves 1.52% improvement under the 10% labeled training setting, which ranked top among all these methods. In addition, the improvement achieved by our method under the 20% training setting is also the highest. Figure 6 shows some visual segmentation results of our semi-supervised method. We can see that our method can better capture the boundary of the optic disc structure.
|Bai et al. ||DenseUNet||90.11%||1.36%||92.22%||0.81%|
|Hung et al. ||DenseUNet||89.55%||0.80%||91.65%||0.24%|
|Supervised||3D U-Net - 4 blocks||88.55%||-||91.10%||-|
|Bai et al. ||3D U-Net - 4 blocks||90.36%||1.81%||91.68%||0.58%|
|Our||3D U-Net - 4 blocks||91.57%||3.02%||92.05%||0.95%|
|Supervised||3D U-Net - 5 blocks||87.97%||-||88.55%||-|
|Bai et al. ||3D U-Net - 5 blocks||89.64%||1.67%||89.65%||1.10%|
|Our||3D U-Net - 5 blocks||90.24%||2.27%||90.53%||1.98%|
|Bai et al. ||DenseUNet||94.20%||0.59%||95.01%||0.40%|
|Hung et al. ||DenseUNet||94.32%||0.71%||94.93%||0.32%|
For this dataset, we evaluate the performance of liver segmentation from CT volumes. Under our semi-supervised setting, we randomly separated the original 131 training data from the challenge into 118 training volumes and 13 testing volumes. For image preprocessing, we truncated the image intensity values of all scans to the range of [-200, 250] HU to remove the irrelevant details. We run experiments with 2D DenseUNet and 3D U-Net to verify the effectiveness of our method. For the 3D U-Net, the input size is randomly cropped to to leverage the information from the third dimension. We also trained with two various U-Net with 4 blocks and 5 blocks to verify the effectiveness of our method on the 3D CT scans.
According to the evaluation of 2017 LiTS challenge, we employed Dice per case score to evaluate the liver segmentation result, which refers to an average Dice score per volume. We report the performance of our method and other two semi-supervised methods under the setting of 10% labeled training images and 20% labeled training images, respectively, in Table VI. We can see that with DenseUNet baseline, our approach achieves the highest performance improvement in both 10% labeled training setting and 20% labeled training setting, with 2.40% and 2.17% improvements respectively. For 3D U-Net, we can see that U-Net with blocks 4 achieves better results than that with blocks 5. In semi-supervised learning, it is obvious that our method gains higher performance consistently than Bai et al.  in both 10% and 20% settings, respectively. We also visualize some liver segmentation results from CT scans in the second row in Figure 6.
Supervised deep learning has been proven extremely effective for many problems in medical image community. However, the promising performance of supervised learning heavily relies on the availability of massive annotations. Developing new learning methods with limited annotation will largely advance the real-world clinical applications. In this work, we focus on developing semi-supervised learning methods for medical image segmentation. These methods have great potential to reduce the annotation effort by taking advantage of numerous amount of unlabeled data and to make progress beyond supervised learning. The key insight of our semi-supervised learning method is the transformation consistent self-ensembling strategy. The extensive experiments on three representative and challenging datasets have sufficiently demonstrated the effective improvements of our method.
Medical image data has different formats, like the 2D in-plane scans (e.g., dermoscopy images and fundus images) and 3D volumetric data (e.g., MRI, CT). In this paper, we employ both 2D and 3D networks to conduct segmentation for these various data formats. Our method is very flexible and can be easily applied on both 2D and 3D networks. It is worth mentioning that the recent works [4, 27] are specifically designed for 3D volume data by considering three-view co-training, i.e., the coronal, sagittal and axial views of the volumetric data. However, we aim for a more general approach that is applicable for 2D and 3D medical images simultaneously. For the 3D semi-supervised learning, it may be a promising direction to design specific methods by consideration the 3D natural property of the volumetric data.
The recent works on network equivariance [23, 28, 29] improve the generalization capacity of the trained network by exploring equivariance property. For example, Cohen and Welling  presented a group equivariant neural network that is equivariant to 90-rotations and dihedral flips, aiming at improving generalization capacity and achieving the higher results under the same level of weights. Our method also leverages the transformation consistency principle, but differently, we aims for the semi-supervised segmentation task. Moreover, if we trained these works, i.e., harmonic network , in the semi-supervised way to leverage the unlabeled data, the transformation regularization will have no effect ideally, since the network outputs are the same when applying the transformation on the input images. Therefore, the limited regularization would restrict the performance improvement from the unlabeled data.
One limitation of our method is that we assume both labeled and unlabeled data come from the same distribution. However, in real-world clinical applications, the labeled and unlabeled data may not be collected from the same distribution, and there may exists domain shift between labeled and unlabeled data. Oliver et al.  demonstrated that the performance of semi-supervised learning methods can degrade substantially when the unlabeled dataset contains out-of-distribution examples. However, most of the current semi-supervised approaches for medical image segmentation do not consider this issue. Therefore, in the future, we would explore the domain adaptation  technique, and investigate how to combine it with self-ensembling strategy to bring our method towards real-world clinical applications.
In this paper, we present a novel semi-supervised learning method for medical image segmentation. The whole framework is trained with a weighted combination of the supervised loss and the unsupervised loss. Specifically, we introduce a transformation consistent self-ensembling model for the segmentation task, which enhances the regularization effects to utilize the unlabeled data and can be easily applied on 2D and 3D networks. Comprehensive experimental analysis on three medical imaging datasets, i.e., skin lesion dataset, retinal image dataset and liver CT dataset, demonstrated the effectiveness of our method. Our method is general enough and can be widely used in other semi-supervised medical image analysis problems. Further works include investigating other domain adaptation techniques to enhance the effectiveness of our semi-supervised learning methods.
M. Mahmud, M. S. Kaiser, A. Hussain, and S. Vassanelli, “Applications of deep learning and reinforcement learning to biological data,”IEEE transactions on neural networks and learning systems, vol. 29, no. 6, pp. 2063–2079, 2018.
V. Cheplygina, M. de Bruijne, and J. P. Pluim, “Not-so-supervised: a survey of semi-supervised, multi-instance, and transfer learning in medical image analysis,”arXiv preprint, 2018.
N. Dong, M. Kampffmeyer, X. Liang, Z. Wang, W. Dai, and E. Xing, “Unsupervised domain adaptation for automatic estimation of cardiothoracic ratio,” inInternational Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2018, pp. 544–552.