Although having achieved great success in medical image segmentation, deep learning-based approaches usually require large amounts of well-annotated data, which can be extremely expensive in the field of medical image analysis. Unlabeled data, on the other hand, is much easier to acquire. Semi-supervised learning and unsupervised domain adaptation both take the advantage of unlabeled data, and they are closely related to each other. In this paper, we propose uncertainty-aware multi-view co-training (UMCT), a unified framework that addresses these two tasks for volumetric medical image segmentation. Our framework is capable of efficiently utilizing unlabeled data for better performance. We firstly rotate and permute the 3D volumes into multiple views and train a 3D deep network on each view. We then apply co-training by enforcing multi-view consistency on unlabeled data, where an uncertainty estimation of each view is utilized to achieve accurate labeling. Experiments on the NIH pancreas segmentation dataset and a multi-organ segmentation dataset show state-of-the-art performance of the proposed framework on semi-supervised medical image segmentation. Under unsupervised domain adaptation settings, we validate the effectiveness of this work by adapting our multi-organ segmentation model to two pathological organs from the Medical Segmentation Decathlon Datasets. Additionally, we show that our UMCT-DA model can even effectively handle the challenging situation where labeled source data is inaccessible, demonstrating strong potentials for real-world applications.
Deep learning has achieved great successes in various computer vision tasks, such as 2D image recognition[31, 56, 59, 22, 26] and semantic segmentation [39, 7, 67, 8]. However, deep networks usually rely on large-scale labeled datasets for training. When it comes to medical volumetric data, human labeling can be extremely costly and often requires expert domain knowledge. Medical image segmentation (i.e. the labeling tissues and organs in CTs and MRIs) plays a critical role in biomedical image analysis and surgical planning. Deep learning-based approaches have been widely adopted for this task and have led to state-of-the-art performance [48, 40, 64, 65]. However, acquiring well-annotated segmentation labels in medical images requires both high-level expertise of radiologists and careful manual labeling of object masks or surface boundaries.
In this paper, we aim to design an approach that can utilize large-scale unlabeled data to improve volumetric medical image segmentation, and is applicable to the scenarios of both semi-supervised learning (SSL) and unsupervised domain adaptation (UDA). SSL and UDA share a common setting by assuming the availability of a labeled training set (denoted as ), as well as an unlabeled one (denoted as ). The difference between the two tasks is that for and we assume the same distribution in SSL while a larger domain shift is assumed in the UDA setting. Despite such differences, approaches in these two tasks are often closely related. SSL approaches such as self-training [49, 37, 2], co-training [5, 70] and GAN based methods [11, 32] have been widely applied to UDA [9, 24, 61, 55, 14, 73, 72], and vice versa.
, we further extend this idea to 3D volumetric data. Typical co-training requires at least two views (i.e. sources) of the data, of which either should be sufficient to train a classifier on. Co-training minimizes the disagreement by assigning pseudo labels between each view on unlabeled data. further proved that co-training has PAC-like guarantees on semi-supervised learning with an additional assumption that the two views are conditionally independent given the category. Since most computer vision tasks have only one source of data, encouraging view differences is a crucial factor for successful co-training. For example, deep co-training  trains multiple deep networks to act as different views by utilizing adversarial examples  to address this issue. Another aspect of co-training to emphasize is view confidence estimation. In multi-view settings, with growing differences between each view, the quality of each prediction becomes less and less guaranteed and might result in bad pseudo labels that can be harmful if used in the training process. Co-training could benefit from trusting reliable predictions and degrading the unreliable ones. However, distinguishing reliable and unreliable predictions is challenging for unlabeled data due to lack of ground-truth.
To address the above two important issues, we propose an uncertainty-aware multi-view co-training (UMCT) framework, shown in Fig. 2. We introduce view differences by exploring multiple viewpoints of 3D data through spatial transformations, such as rotation and permutation. The permutation here is defined as the rearrangements of the coordinate system, such as transpose and flip, and “view” is defined as the transformed input data after permutation. Hence, our multi-view approach naturally applies to analyzing 3D data and can be integrated with the proposed co-training framework. Fig. 1 gives an example of the intuition of our approach in two-view scenario. On unlabeled data, we propose to maximize the similarity of the predictions between the two views, resulting in improved segmentation performance on each view. Another key component is the view confidence estimation. We propose to estimate the uncertainty of predictions in each view with Bayesian deep networks by adding dropout in the architectures . A confidence score is computed based on epistemic uncertainty , which can act as a weight for each prediction. After propagation through this uncertainty-weighted label fusion module (ULF), a set of more accurate pseudo labels can be obtained for each view, which is used as supervision signal for unlabeled data.
UMCT was previously published as a conference paper , in which we verified its effectiveness under standard semi-supervised settings on individual organs. In this paper, we extensively validate our approach on more challenging tasks, e.g. multi-organ segmentation. Moreover, we apply our approach to the task of unsupervised domain adaptation, with a labeled source domain and unlabeled target domain. In medical image analysis, this is considered as an important task since we should prefer a model or an approach that has the capability to generalize across datasets from different data sources (e.g. types of machines, acquisition protocols, and characteristics of patients). In addition to the original experiments on NIH pancreas dataset , we validate our approach on a multi-organ dataset used in  with 8 labeled abdominal organs under semi-supervised settings. We then utilize our co-training approach to adapt the multi-organ model to the Medical Image Decathlon (MSD) ) pathological liver and pancreas datasets. We even push our approach one step further by assuming that we only have the source model in the absence of source data. With very simple modifications, our final model UMCT-DA illustrates strong potential on this challenging scenario.
2 Related Work
. Emerging semi-supervised approaches have been successfully applied to image recognition using deep neural networks[1, 47, 53, 33, 60, 41, 6]. These algorithms mostly rely on additional regularization terms to train the networks to be resistant to some specific noise. A recent approach  extended the co-training strategy to 2D deep networks and multiple views, using adversarial examples to encourage view differences to boost performance.
Semi-supervised medical image analysis.  mentioned that current semi-supervised medical analysis methods fall into 3 types - self-training (teacher-student models), co-training (with hand-crafted features) and graph-based approaches (mostly applications of graph-cut based optimization).  introduced a deep network based self-training framework with conditional random field (CRF) based iterative refinements for medical image segmentation.  trained three 2D networks from three planar slices of the 3D data and fused them in each self-training iteration to get a stronger student model.  extended the self-ensemble approach model  with 90-degree rotations making the network rotation-invariant. Generative adversarial network (GAN) based approaches are also popular recently for medical imaging [12, 28, 43]. Moreover, mixed supervisions [54, 42] combining dense label masks and weak labels like bounding boxes, slice- or image-level labels, etc., is another field of study to alleviate labeling efforts and is related to semi-supervised medical image analysis.
Uncertainty Estimation. Traditional approaches include particle filtering and CRFs [4, 23]. For deep learning, uncertainty is more often measured with Bayesian deep networks [15, 16, 30]. In our work, we emphasize the importance of uncertainty estimation in semi-supervised learning, since most of the training data here is not annotated. We propose to estimate the confidence of each view in our co-training framework via Bayesian uncertainty estimation.
2D/3D hybrid networks. 2D networks and 3D networks both have advantages and limitations. The former benefits from 2D pre-trained weights and well-studied architectures on natural images, while the latter better explores 3D information utilizing 3D convolution kernels. [63, 35]
either uses 2D probability maps or 2D feature maps for building 3D models. proposed a 3D architecture which can be initialized by 2D pre-trained models. Moreover, [51, 69]
illustrates the effectiveness of multi-view training on 2D slices, even by simply averaging multi-planar results, indicating complementary latent information exists in the biases of 2D networks. This inspired us to train 3D multi-view networks with 2D initializations jointly using an additional loss function for multi-view networks which encourages each network to learn from one another.
Unsupervised Domain Adaptation. Contrary to semi-supervised learning, domain adaptation problems often contain two datasets that have different distribution. Under unsupervised domain adaptation (UDA) settings, networks are trained from a labeled source domain and an unlabeled target domain. Traditional approaches [52, 20, 44, 58] align domains with statistical constraints. Recent works [17, 18, 24, 61, 25, 55, 14, 73, 72] utilizes adversarial training and self-training to adapt feature training between source domain and target domain. In the field of medical image analysis, [29, 13, 45] have investigated this topic with existing approaches, i.e. adversarial training and self-training.
3 Problem Definitions
Before we describe our proposed approach, we firstly discuss the definition and relationships of the three problems, i.e. semi-supervised learning (SSL), unsupervised domain adaptation (UDA) and UDA without data from source domain. Table 1 lists the comparison among the three problems.
Semi-supervised learning. Under standard semi-supervised learning (SSL) settings, we denote and as the labeled and unlabeled dataset, respectively. Let be the whole available dataset. We denote each labeled data pair as and unlabeled data as . We aim to improve performance on a specific task with unlabeled data. When we consider volumetric medical image segmentation,
is a three-dimensional tensor and the ground truthis a densely-labeled voxel-wise 3D segmentation mask.
Unsupervised domain adaptation (UDA) assumes a labeled source domain dataset and an unlabeled target domain dataset , where distributions of data are different but tasks are identical. Our goal is to achieve relatively high performance of a specific task on the target domain. In medical image analysis, domain gaps can result from differences in imaging modalities (e.g. CT / MRI / PET), qualities or imaging protocols (e.g. various machine types and doses of radiation), types of patients (e.g. healthy or with disease), and combinations thereof. The difference between UDA and SSL only lies in data distributions, so SSL approaches can also be applied to solve UDA problems. In our paper, we illustrate that our proposed approach can effectively handle both problems.
UDA without data from source domain was barely investigated in the literature but is an important challenge to be addressed in the field of medical imaging. Here we assume an available pre-trained model from the source domain and unlabeled data from target domain. Differently from UDA, data from source domain is absent.
In our work, we aim to propose a unified approach that is capable of solving the three tasks described above.
|UDA w/o||N/A||unlabeled||no||on dataset 1|
4 Uncertainty-aware Multi-view Co-training
In this section, we introduce our framework of uncertainty-aware multi-view co-training (UMCT) for semi-supervised segmentation and domain adaptation. UMCT is designed to effectively utilize unlabeled data, which is firstly targeted at semi-supervised segmentation of volumetric medical images. In the following sections, we will explain how they are achieved in our 3D framework: a general mathematical formulation of the approach is shown in Sec 4.1; then we demonstrate how to encourage view differences in Sec 4.2, and how to compute the confidence of each view by uncertainty estimation in Sec 4.3, which are the two factors to boost the performance of co-training. Last but not least, the UMCT-DA model is introduced in Sec 4.4 for unsupervised domain adaptation.
4.1 Overall Framework
We first consider the task of semi-supervised segmentation for 3D data. Recall that and are the labeled and unlabeled set, respectively. Each labeled data pair is denoted as and unlabeled data as . The ground truth is a voxel-wise segmentation label map which has the same shape as .
Suppose for each input , we can generate different views of 3D data by applying a transformation (rotation or permutation), resulting in multi-view inputs , . Such operations will introduce a data-level view difference. models , are then trained over each view of data respectively. For , a supervised loss function is optimized to measure the similarity between the prediction of each view and :
where is a standard loss function for segmentation tasks and are the corresponding voxel-wise prediction score maps after inverse rotation or permutation.
For unlabeled data, we make a co-training assumption under a semi-supervised setting. The co-training strategy assumes the predictions on each view should reach a consensus. So the prediction of each model can act as a pseudo label to supervise other views in order to learn from unlabeled data. However, since the prediction of each view is expected to be diverse after encouraging the view differences, the quality of each view’s prediction needs to be measured before generating trustworthy pseudo labels. This is accomplished via uncertainty-weighted label fusion module (ULF) introduced in Sec 4.3. With ULF, the co-training loss for unlabeled data can be formulated as:
is the pseudo label for the view, is the ULF computational function, which we will further explain in Sec 4.3.
Overall, the combined loss function is:
where is a tunable weight coefficient.
4.2 Encouraging View Differences
A successful co-training requires the “views” to be different in order to learn complementary information in the training procedure. In our framework, several techniques are proposed to encourage view differences, both at the data level and the feature level of the neural networks.
3D multi-view generation. As stated above, in order to generate multi-view data, we transpose into multiple views by rotations or permutations333A permutation rearranges the dimensions of an array in a specific order. . For three-view co-training, these can correspond to the coronal, sagittal and axial views in medical imaging, which matches the multi-planar reformatted views that radiologists typically use to analyze the image. Such operation is a natural way to introduce data-level view difference.
Asymmetric 3D kernels and 2D initialization. The co-training assumption encourages models to make similar predictions on both and , which potentially can lead to collapsed neural networks mentioned in , a phenomenon that results in a sudden and significant drop in validation accuracy during training of co-training algorithms. In our multi-view settings, this could also happen when the models from different views only learn the permutation or rotation of the kernels, resulting in exactly the same learned feature representation despite the view-point difference. To address this problem, we further encourage view difference at the feature level by designing a task-specific model. We propose to use asymmetric 3D models initialized with 2D pre-trained weights as the backbone network of each view to encourage diverse features for each view learning. In practice, we modify the symmetric 3D convolutional kernels into for each branch after the permutation to avoid learning symmetrical representations among views. This structure also makes the model convenient to be initialized with 2D pre-trained weights but fine-tuned in a 3D fashion.
4.3 Compute Reliable Psuedo Labels for Unlabeled Data with Uncertainty Estimation
Encouraging view difference means enlarging the variance of each view’s prediction. This raises the question of which view we should trust most on unlabeled data during co-training. Bad predictions from one view may hurt the training procedure of other views through pseudo-label assignments. Meanwhile, encouraging to trust a good prediction as a “strong” label from co-training will boost the performance, and lead to improved performance of overall semi-supervised learning. Instead of assigning a pseudo-label for each view directly from the predictions of other views, we propose an adaptive approach, namely uncertainty-weighted label fusion module (ULF), to fuse the outputs of different views. ULF is built up of all the views, takes the predictions of each view as input, and then outputs a set of pseudo labels for each view.
Motivated by uncertainty measurements in Bayesian deep networks, we measure the uncertainty of each view branch for each training sample after turning our model into a Bayesian deep network by adding dropout layers. Between the two types of uncertainty candidates – aleatoric and epistemic uncertainties, we choose to compute the epistemic uncertainty that is driven by the lack of training data . Such measurement fits the semi-supervised learning goal: to improve the model’s generalizability by exploring unlabeled data. Suppose is the output of a Bayesian deep network, then the epistemic uncertainty can be estimated as:
where are a set of sampled outputs. These sampled outputs are obtained by feeding the same input volume into the sub-network defined by different random dropout configurations . The voxel-wise epistemic uncertainty is estimated as the statistical variance of the predictions. More details are available in Sec 4.5.
With a transformation function , we can transform the uncertainty score into a confidence score . In practice, we simply define . After normalization over all views, the confidence score will act as the weight for each prediction to assign as a pseudo label for other views. The pseudo label assigned for a single view can be formulated as
Thus the pseudo label for view is computed from predictions from all the other views.
4.4 UMCT-DA model for unsupervised domain adaptation
Standard unsupervised domain adaptation (UDA)
We extensively validate our approach on unsupervised domain adaptation setting, where the labeled source domain and the unlabeled target domain are available for training. The task is shared between the two domains and the ultimate goal is to achieve good performance on the target domain test data. Despite the domain shift in labeled and unlabeled data, the overall settings of semi-supervised learning (SSL) and unsupervised domain adaptation (UDA) are the same. Hence, we can directly apply our UMCT to solve this problem. The optimization objective can be modified as the follows :
where is the labeled source domain and is the unlabeled target domain.
UDA without source domain data
Standard UDA methods usually require the existence of source domain data to allow joint training while doing adaptation to the target domain. Here we consider a more challenging setting where source domain data is unavailable and only deep network model (denoted as ) pre-trained on source domain is available. In our co-training framework, when source data is unavailable, we can still finetune with by iteratively refining pseudo labels. The objective function for UDA without source domain data, namely UMCT-DA, can be formulated as:
4.5 Implementation Details
Network Structure. In practice, we build an encoder-decoder network based on ResNet-18 , and modify it into a 3D version. For the encoder part, the first convolution layer is extended to
kernels for low-level 3D feature extraction similar to. All other convolution layers are simply changed into that can be trained as a 3D convolution layer. In the decoder part, we adopt 3 skip connections from the encoder followed by 3D convolutions to give low-level cues for more accurate boundary prediction needed in segmentation tasks.
Uncertainty-weighted Label Fusion. In terms of view confidence estimation, we modify the network into a Bayesian deep network by adding dropouts. We sample outputs for each view and compute voxel-wise epistemic uncertainty. Since we are using Dice loss , a common loss function for medical image segmentation which is computed on the image level, an image-wise uncertainty estimation is most suitable. We thus sum over the whole volume to estimate the uncertainty for each view. We then simply use the reciprocal for the confidence transformation function to compute the confidence score. The resulting pseudo label assigned for each view is a weighted average of all predictions of multiple views based on the normalized confidence score.
Loss Function. We extend the Dice loss  for multi-class targets as our training objective function:
Data Pre-Processing. All the training and testing data are firstly re-sampled to an isotropic volume resolution of 1.0 for each axis. Data intensities are normalized to have zero mean and unit variance. We adopt patch-based training, and sample training patches of size with : ratio between foreground and background.
Training. Our training algorithm is shown in Algorithm 1
. We firstly train the views separately on the labeled data and then conduct our co-training by fine-tuning the weights. The stochastic gradient descent (SGD) optimizer is used in both stages. In the view-wise training stage, a constant learning rate policy at, momentum at and weight decay of for 20k iterations is used. In the co-training stage, we adopt a constant learning rate policy at and train for 5k iterations. The parameter resulted in the best performance which we report here. The batch size is 20 in co-training, among which 4 images are labeled and 16 are unlabeled, maintaining a ratio of labeled and unlabeled to be 1:4.
Our framework is implemented in PyTorch.For 3D ResNet-18 on NIH dataset, the whole co-training procedure takes 24 hours on one single NVIDIA Titan RTX GPU with 24 GB memory. In our implementation, training occupies 15GB GPU memory in total.
Testing. In the testing phase, there are two choices to finalize the output results: either to choose one single view prediction or to ensemble the predictions of the multi-view outputs with majority voting. We will report both results in subsequent sections for fair comparisons with the baselines since the multiple view networks can be thought of being similar to the ensemble of several single view models. The experimental results show that our model improves the performance in both settings (single view and multi-view ensemble) over all the other approaches. We use sliding-window testing and re-sample our testing results back to the original image resolution to obtain the final results. Testing time for each case ranges from minute to minutes depending on the size of the input volume.
|Method||Backbone||10% lab||20% lab|
|DMPCT ||2D ResNet-101||63.45||66.75|
|DCT  (2v)||3D ResNet-18||71.43||77.54|
|TCSE ||3D ResNet-18||73.87||76.46|
|Ours (2 views)||3D ResNet-18||75.63||79.77|
|Ours (3 views)||3D ResNet-18||77.55||80.14|
|Ours (6 views)||3D ResNet-18||77.87||80.35|
|Ours (ensemble)||3D ResNet-18||78.77||81.18|
In this section, we first evaluate our framework under semi-supervised settings on the NIH pancreas segmentation dataset  with cases from a healthy patient population (e.g. kidney donors444https://wiki.cancerimagingarchive.net/display/Public/Pancreas-CT); and an multi-organ segmentation dataset  with eight abdominal organs  with conditions mostly unrelated to the organs of interest (e.g. colorectal cancer or ventral hernia555https://www.synapse.org/#!Synapse:syn3193805/wiki/217789). We will provide detailed experiments, including ablation studies, on the former dataset. Note that the volumes come from different patients in each dataset and were separated at the patient-level for the different training, validation and testing splits. Next, we validate the capability of our approach on the task of unsupervised domain adaptation, which is critical but under-investigated in the field of medical image analysis. The multi-organ segmentation dataset serves as source data. The targets of adaptation include two pathological organ datasets i.e. pancreas and liver datasets in the Medical Segmentation Decathlon (MSD) , which both can include tumors in their respective organs. More strictly, we also evaluate our approach under the situation where source data is inaccessible (UDA without source data).
5.1 NIH Pancreas Segmentation Dataset
The NIH pancreas segmentation dataset contains 82 abdominal CT volumes. The width and height of each volume are 512, while the axial view slice number can vary from 181 to 466. Under semi-supervised settings, the dataset is randomly split into 20 testing cases and 62 training cases. We report the results of 10% labeled training cases (6 labeled and 56 unlabeled), 20% labeled training cases (12 labeled and 50 unlabeled) and 100% labeled training cases.
In Table 2, we first report the average of all single views’ DSC score for a fair comparison (2 views to 6 views, last 2-4 rows), which can be viewed as the average performance of one single view model. Then we report the multi-view ensemble results (6 view ensemble, last row), where we align the multi-view prediction maps to the same view (axial) and average the prediction maps at each pixel to make a final prediction. For 2-view co-training, we use the axial and coronal views. For 3-view co-training, we use the axial, coronal and sagittal view. For 6-view co-training, we use the axial, coronal and sagittal view as well as the horizontal flip version of the three views (). The first row is the supervised training results, using only labeled data and trained on the axial view. The segmentation accuracy is evaluated by Dice-Sørensen coefficient (DSC). A large margin improvement over the fully supervised baselines in terms of single view performance can be observed, proving that our approach effectively leverages the unlabeled data. A Wilcoxon signed-rank test comparing to the supervised baseline’s results (20% labeling) shows significant improvements of our approach with a -value of 0.0022. Fig. 3 shows 3 cases in 2D and 3D with ITK-SNAP . In addition, our model is compared with the state-of-the-art semi-supervised approach of deep co-training  and recent semi-supervised medical segmentation approaches. In particular, we compare to  who extended the model  with transformation consistent constraints; and  who extended the self-training procedure by iteratively updating pseudo labels on unlabeled data using a fusion of three 2D networks trained on cross-sectional views. The results reported in Table 2 are based on our careful re-implementations in order to allow a fair comparison.
The implementations of  and  are operated on the axial view of our single view branch with the same backbone structure (our customized 3D ResNet-18 model). Our co-training approach achieve about 4% gain in the 10% labeled and 90% unlabeled settings. We also find that improvements of other approaches are small in the 20% settings (only 1% compared to the baseline), while ours still is capable to achieve a reasonable performance gain with the growing number of labeled data. For  with a 2D approach, their experiment is conducted on 50 labeled cases. We modify their backbone network (FCN ) into DeepLab v2 , in order to fit our stricter settings (6 and 12 labeled cases). This modification leads to an improvement of 3% in 100% fully supervised training (from 73% to 76%). Their approach outputs the result after using an ensemble with majority voting of three slice-wise 2D models obtained from their semi-supervised training approach..
Since the main difference in two-view learning between our approach and  is the way of encouraging view differences, the results illustrate the effectiveness of our multi-view analysis combined with asymmetric feature learning on 3D co-training. With more views, our uncertainty-weighted label fusion can further improve co-training performance. We will report ablation studies later in this section.
5.1.2 Analysis and ablation studies
Data utilization efficiency
We perform a study on data utilization efficiency of our approach compared to the baseline fully-supervised network (3D ResNet-18). Fig. 4 shows the performance change according to labeled data proportion on NIH pancreas segmentation. From the plot, one can see that when labeled data is over 80%, simple supervised training (with 3D ResNet-18) suffices. Note that our approach with 20% labeled data (DSC 80.35%) performs better than 60% supervised training (DSC 78.95%). At such a performance, our approach can save 70% of the labeling efforts.
Effect of backbone structure
Our backbone selection (2D-initialized, heavily asymmetric 3D architecture) will introduce 2D biases in the training phase while benefiting from such 2D pre-trained models. We have claimed that we can utilize the complementary information from 3-view networks while exploring the unlabeled data with UMCT. We give an ablation study on the network structure, which contains a V-Net , a common 3D segmentation network with all symmetrical kernels in all dimensions. Such network also shares a similar amount of parameters with our customized 3D ResNet-18, see Table 3. The results of V-Net show that our multi-view co-training can be generally and successfully applied to 3D networks. Although the results of fully supervised parts are similar, our ResNet-18 outperforms V-Net by more than , illustrating that our asymmetric design, encouraging view differences, brings advantages over traditional 3D deep networks.
Uncertainty-weighted label fusion (ULF)
ULF acts as an important role in pruning out bad predictions and keeping good ones as supervision to train other views. Table 4 gives the single view results in multiple views experiments. The performance becomes better with more views. For two views, ULF is not applicable since we can only obtain one view prediction as a pseudo label for the other view. For three views and six views, ULF helps boost the performance, illustrating the effectiveness of our proposed approach for view confidence estimation.
|3 views + ULF||77.55|
|6 views + ULF||77.87|
|Supervised (upper bound)||94.20||93.90||71.89||66.74||94.78||88.60||81.46||71.29|
|10% lab+90% unlab (ours)||91.14||92.35||58.29||57.61||92.23||79.67||73.86||57.50|
|20% lab+80% unlab (ours)||92.80||92.99||66.29||65.01||93.93||83.67||77.91||63.34|
5.2 Multi-organ Segmentation Dataset
Next, we validate our approach on multi-organ datasets. The dataset we use is a multi-organ re-annotated version from , combining two public datasets - The Cancer Image Archive (TCIA) Pancreas-CT data set  and Beyond the Cranial Vault (BTCV) Abdomen data set . We perform 4-fold cross validation on 90 cases in total. In each fold, we then randomly split the training cases into our labeled set and unlabeled set . We train our models on different labeled data ratio 10%, 20%, which approximately corresponds to (7,81), (13,75) of (labeled, unlabeled) data pairs and validate on 22 cases in each fold. Results are shown in Table 5 and an example is shown in Fig 5.
Our approach improves consistently over almost every organ under every labeled-unlabeled ratio of data. The results illustrate the ability of our approach to handle the situation of complex multi-organ settings.
5.3 Unsupervised domain adaptation from multi-organ segmentation to MSD Dataset
We aim to unsupervisedly adapt a model trained on TCIA multi-organ dataset to pancreas and liver cases in Medical Decathlon Challenge  that can exhibit tumors. The target domains contain a shift from the source domain because of the differences in (i) image quality and contrast, and (ii) textures due to the existence of pancreatic/hepatic tumors.
MSD pancreas dataset contains 282 CT scans in portal venous phase, all of which are pathological cases with pancreas and tumor annotation. We randomly split the whole dataset into 200 cases for training (without label) and 81 cases for validation. Since the source domain (multi-organ dataset) only contains healthy pancreas, we aim at segmenting the whole pancreas region (combining pancreas and tumor together). For MSD liver dataset, we aim at segmenting the whole liver region as well, with a random split of 100 training cases (unlabeled) and 31 cases for validation. 118 out of 131 cases contains hepatic tumor.
Table 6 shows the results of unsupervised domain adaptation experiments. The first row is the segmentation performance (in terms of DSC) on pancreas and liver of the original multi-organ validation set. From the second row to the last, the results are DSC scores on MSD liver / pancreas validation set. In standard UDA settings (UMCT w/ source), significant improvements are achieved with our approach (1.12% in liver and 4.70% in pancreas), compared to source only version (direct Test on MSD). Due to the superior performance of self-training based approaches [73, 72] for unsupervised domain adaptation, we implement a vanilla self-training method under our settings (denoted as ”Self-training”). We first test the model on the unlabeled set and then use the prediction as pseudo labels to train on the whole data set. We iterate these two steps every 1k iterations and trains for 5k iterations in total, which is in line with the proposed co-training scheme. We also implement another baseline approach (AdaptSegNet , denoted as ”Adv training”), which applies adversarial training onto the predicted masks of semantic segmentation. The segmentation network serves as the generator to output segmentation masks with segmentation loss and tries to fool a patch-based discriminator (a 3D version of the discriminator used in AdaptSegNet ) with GAN loss . The discriminator is also trained jointly to distinguish between the predicted mask and the ground-truth mask on unlabeled data. Our approach significantly outperforms all of the baselines. We also show one example for each organ in Fig 6.
The last row gives the results in the absence of source domain data. Under such condition, only a pre-trained source domain model and unlabeled target domain data are available. Our UMCT-DA model (last row) is able to solve this problem by only using the co-training loss to train on the target dataset, and achieve comparable results with standard UDA settings, even without source domain data.
6.1 Impact on large-scale benchmarks
Under fully supervised training, our team NVDLMED was ranked the place in the first phase and the place in the final validation phase of Medical Segmentation Decathlon Challenge  (challenge leaderboard available666http://medicaldecathlon.com/results.html). We applied our 3-view co-training framework taken from axial, coronal and sagittal views to ten medical image segmentation tasks simultaneously. The winning team  applied heavy model selection and ensemble by cross-validation on the training set, while we used a fixed framework without complicated data augmentation. Although not originally targeted at improving the performance of fully supervised training, our approach still illustrated the effectiveness and robustness of co-training from multiple views.
6.2 Magnitude of domain shift
In this work, domain shift mainly lies in various sources of CT scans and the pathological/healthy status of abdominal organs. We consider it a reasonable domain shift from different CT datasets originating from different hospitals and patient populations. Typically, this means that when directly transferring a model from one to another (TCIA to MSD in our case), the performance drops significantly (liver 95% to 92%, pancreas 81% to 70% average Dice). While this shift is relatively small compared to, for example, cross modality testing (say CT to MRI), it is unacceptable when considering these models for potential clinical applications. Considering the importance of this topic, we shed light on how well our semi-supervised approach performs on UDA tasks, given their similarity (discussed in Sec 3). Other types of domain shifts of medical images, though not investigated in this paper, are also of great importance. Investigation of domain adaptation under larger domain shifts such as modality changes (e.g. CT to MRI adaptation), contrast and resolution issues remains an active research topic.
7 Summary & Conclusion
In this paper, we presented uncertainty-aware multi-view co-training (UMCT), aimed at semi-supervised learning and domain adaptation. We extended dual view co-training and deep co-training into 3D volumetric image data by analysing from different view-points, then estimating uncertainty and finally enforcing multi-view consistency on large scale unlabeled data. Our approach was first validated on NIH pancreas dataset, where we outperformed other approaches by a large margin. We further applied our approach to multi-organ datasets and found significant improvements for each organ. Finally, we adapted the multi-organ dataset to MSD pathological pancreas and liver in an unsupervised manner. Our UMCT-DA model achieved good performance even in the absence of source domain data, illustrating strong potential for real-world applications in medical image segmentation.
In the future, we plan to conduct further research in the following aspects. Currently the views of co-training are fixed and pre-defined, so one feasible idea is to incorporate more views and random views. This could increase the robustness of our model and lead to better performance. For domain adaptation, we will also try to explore co-training based approaches on other types of domain shifts including but not limited to image modality changes and contrast variants. We believe co-training based approaches will make a contribution to large scale medical image analysis with limited human annotations.
-  (2014) Learning with pseudo-ensembles. In Advances in Neural Information Processing Systems, pp. 3365–3373. Cited by: §2.
-  (2017) Semi-supervised learning for network-based cardiac mr image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 253–260. Cited by: §1, §2.
Manifold regularization: a geometric framework for learning from labeled and unlabeled examples.
Journal of machine learning research7 (Nov), pp. 2399–2434. Cited by: §2.
-  (1993) A framework for spatiotemporal control in the tracking of visual contours. International Journal of Computer Vision 11 (2), pp. 127–145. Cited by: §2.
Combining labeled and unlabeled data with co-training.
Proceedings of the eleventh annual conference on Computational learning theory, pp. 92–100. Cited by: §1, §1, §2.
Tri-net for semi-supervised deep learning.
Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 2014–2020. Cited by: §2.
-  (2018) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §1, §5.1.1.
-  (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: §1.
-  (2011) Co-training for domain adaptation. In Advances in neural information processing systems, pp. 2456–2464. Cited by: §1.
Not-so-supervised: a survey of semi-supervised, multi-instance, and transfer learning in medical image analysis. arXiv preprint arXiv:1804.06353. Cited by: §2.
-  (2017) Good semi-supervised learning that requires a bad gan. In Advances in neural information processing systems, pp. 6510–6520. Cited by: §1.
-  (2018) Unsupervised domain adaptation for automatic estimation of cardiothoracic ratio. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 544–552. Cited by: §2.
-  (2018) Unsupervised cross-modality domain adaptation of convnets for biomedical image segmentations with adversarial loss. arXiv preprint arXiv:1804.10916. Cited by: §2.
-  (2018) Self-ensembling for visual domain adaptation. Cited by: §1, §2.
-  (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. Cited by: §1, §2.
-  (2016) Uncertainty in deep learning. University of Cambridge. Cited by: §2.
-  (2016) Domain-adversarial training of neural networks. The Journal of Machine Learning Research 17 (1), pp. 2096–2030. Cited by: §2.
-  (2016) Domain-adversarial training of neural networks. The Journal of Machine Learning Research 17 (1), pp. 2096–2030. Cited by: §2.
-  (2018) Automatic multi-organ segmentation on abdominal ct with dense v-networks. IEEE transactions on medical imaging 37 (8), pp. 1822–1834. Cited by: §1, §5.2, §5.
Geodesic flow kernel for unsupervised domain adaptation.
2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2066–2073. Cited by: §2.
-  (2015) Explaining and harnessing adversarial examples. In International Conference on Learning Representations, Cited by: §1, §5.3.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §4.5.
-  (2004) Multiscale conditional random fields for image labeling. In Computer vision and pattern recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE computer society conference on, Vol. 2, pp. II–II. Cited by: §2.
-  (2017) Cycada: cycle-consistent adversarial domain adaptation. arXiv preprint arXiv:1711.03213. Cited by: §1, §2.
-  (2016) Fcns in the wild: pixel-level adversarial and constraint-based adaptation. arXiv preprint arXiv:1612.02649. Cited by: §2.
-  (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §1.
-  (2018) Nnu-net: self-adapting framework for u-net-based medical image segmentation. arXiv preprint arXiv:1809.10486. Cited by: §6.1.
-  (2018) Tumor-aware, adversarial domain adaptation from ct to mri for lung cancer segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 777–785. Cited by: §2.
-  (2017) Unsupervised domain adaptation in brain lesion segmentation with adversarial networks. In International conference on information processing in medical imaging, pp. 597–609. Cited by: §2.
-  (2017) What uncertainties do we need in bayesian deep learning for computer vision?. In Advances in neural information processing systems, pp. 5574–5584. Cited by: §1, §2, §4.3, §4.3.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
-  (2017) Semi-supervised learning with gans: manifold invariance with improved inference. In Advances in Neural Information Processing Systems, pp. 5534–5544. Cited by: §1.
-  (2017) Temporal ensembling for semi-supervised learning. International Conference on Learning Representations (ICLR). Cited by: §2, §2, §5.1.1.
-  (2015) MICCAI multi-atlas labeling beyond the cranial vault–workshop and challenge. Cited by: §5.2, §5.
-  (2017) H-denseunet: hybrid densely connected unet for liver and liver tumor segmentation from ct volumes. IEEE Transactions on Medical Imaging. Cited by: §2.
-  (2018) Semi-supervised skin lesion segmentation via transformation consistent self-ensembling model. BMVC. Cited by: §2, Table 2, §5.1.1, §5.1.1.
-  (2008) A self-training semi-supervised svm algorithm and its application in an eeg-based brain computer interface speller system. Pattern Recognition Letters 29 (9), pp. 1285–1294. Cited by: §1.
-  (2018) 3d anisotropic hybrid network: transferring convolutional features from 2d images to 3d anisotropic volumes. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 851–858. Cited by: §2, §4.5.
-  (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §1, §5.1.1.
-  (2016) V-net: fully convolutional neural networks for volumetric medical image segmentation. In 2016 Fourth International Conference on 3D Vision (3DV), pp. 565–571. Cited by: §1, §4.5, §4.5, §5.1.2.
-  (2018) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2.
-  (2019) Deep learning with mixed supervision for brain tumor segmentation. Journal of Medical Imaging 6 (3), pp. 034002. Cited by: §2.
-  (2018) ASDNet: attention based semi-supervised deep networks for medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 370–378. Cited by: §2.
-  (2010) Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks 22 (2), pp. 199–210. Cited by: §2.
-  (2019) Unsupervised domain adaptation for medical imaging segmentation with self-ensembling. NeuroImage 194, pp. 1–11. Cited by: §2.
-  (2018) Deep co-training for semi-supervised image recognition. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 135–152. Cited by: §1, §2, §4.2, Table 2, §5.1.1, §5.1.1, §5.1.1.
-  (2015) Semi-supervised learning with ladder networks. In Advances in Neural Information Processing Systems, pp. 3546–3554. Cited by: §2.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §1.
-  (2005) Semi-supervised self-training of object detection models.. WACV/MOTION 2. Cited by: §1.
-  (2015) Deeporgan: multi-level deep convolutional networks for automated pancreas segmentation. In MICCAI, Cited by: §1, §5.2, §5.
-  (2018) Spatial aggregation of holistically-nested convolutional neural networks for automated pancreas localization and segmentation. Medical image analysis 45, pp. 94–107. Cited by: §2.
-  (2010) Adapting visual category models to new domains. In European conference on computer vision, pp. 213–226. Cited by: §2.
-  (2016) Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In Advances in Neural Information Processing Systems, pp. 1163–1171. Cited by: §2.
-  (2018) MS-net: mixed-supervision fully-convolutional networks for full-resolution segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 379–387. Cited by: §2.
-  (2018) A dirt-t approach to unsupervised domain adaptation. Cited by: §1, §2.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
-  (2019) A large annotated medical image dataset for the development and evaluation of segmentation algorithms. arXiv preprint arXiv:1902.09063. Cited by: §1, §5.3, §5, §6.1.
-  (2015) Subspace distribution alignment for unsupervised domain adaptation.. In BMVC, Vol. 4, pp. 24–1. Cited by: §2.
-  (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §1.
-  (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, pp. 1195–1204. Cited by: §2.
-  (2018) Learning to adapt structured output space for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7472–7481. Cited by: §1, §2, §5.3.
-  (2020) 3D semi-supervised learning with uncertainty-aware multi-view co-training. In The IEEE Winter Conference on Applications of Computer Vision, pp. 3646–3655. Cited by: §1.
-  (2018) Bridging the gap between 2d and 3d organ segmentation with volumetric fusion net. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 445–453. Cited by: §2.
Volumetric convnets with mixed residual connections for automated prostate segmentation from 3D MR images. In AAAI, Cited by: §1.
Recurrent saliency transformation network: incorporating multi-stage visual cues for small organ segmentation. Cited by: §1.
-  (2006) User-guided 3D active contour segmentation of anatomical structures: significantly improved efficiency and reliability. Neuroimage 31 (3), pp. 1116–1128. Cited by: §5.1.1.
-  (2017) Pyramid scene parsing network. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 2881–2890. Cited by: §1.
-  (2019) Semi-supervised multi-organ segmentation via multi-planar co-training. WACV. Cited by: §2, Table 2, §5.1.1, §5.1.1.
-  (2017) A fixed-point model for pancreas segmentation in abdominal ct scans. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 693–701. Cited by: §2.
-  (2005) Semi-supervised regression with co-training.. In IJCAI, Vol. 5, pp. 908–913. Cited by: §1, §2.
-  (2005) Tri-training: exploiting unlabeled data using three classifiers. IEEE Transactions on knowledge and Data Engineering 17 (11), pp. 1529–1541. Cited by: §2.
-  (2019) Confidence regularized self-training. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5982–5991. Cited by: §1, §2, §5.3.
-  (2018) Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 289–305. Cited by: §1, §2, §5.3.