Dual-Teacher: Integrating Intra-domain and Inter-domain Teachers for Annotation-efficient Cardiac Segmentation

07/13/2020 ∙ by Kang Li, et al. ∙ The Chinese University of Hong Kong 11

Medical image annotations are prohibitively time-consuming and expensive to obtain. To alleviate annotation scarcity, many approaches have been developed to efficiently utilize extra information, e.g.,semi-supervised learning further exploring plentiful unlabeled data, domain adaptation including multi-modality learning and unsupervised domain adaptation resorting to the prior knowledge from additional modality. In this paper, we aim to investigate the feasibility of simultaneously leveraging abundant unlabeled data and well-established cross-modality data for annotation-efficient medical image segmentation. To this end, we propose a novel semi-supervised domain adaptation approach, namely Dual-Teacher, where the student model not only learns from labeled target data (e.g., CT), but also explores unlabeled target data and labeled source data (e.g., MR) by two teacher models. Specifically, the student model learns the knowledge of unlabeled target data from intra-domain teacher by encouraging prediction consistency, as well as the shape priors embedded in labeled source data from inter-domain teacher via knowledge distillation. Consequently, the student model can effectively exploit the information from all three data resources and comprehensively integrate them to achieve improved performance. We conduct extensive experiments on MM-WHS 2017 dataset and demonstrate that our approach is able to concurrently utilize unlabeled data and cross-modality data with superior performance, outperforming semi-supervised learning and domain adaptation methods with a large margin.



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep convolutional neural networks (CNNs) have made great progress in various medical image segmentation applications 

[15, 19]. The success is partially relied on massive datasets with abundant annotations. However, collecting and labeling such large-scaled dataset is prohibitively time-consuming and expensive, especially in medical area, since it requires diagnostic expertise and meticulous work [13]. Plenty of efforts have been devoted to alleviate annotation scarcity by utilizing extra supervision. Among them, semi-supervised learning and domain adaptation are two widely studied learning approaches and increasingly gain people’s interests.

Semi-supervised learning (SSL) aims to leverage unlabeled data to reduce the usage of manual annotations [12, 11, 20]. For example, Lee et al. [12] proposed to generate the pseudo labels of unlabeled data by a pretrained model, and utilize them to further finetune the training model for performance improvements. Recently, self-ensembling methods [11, 20] have achieved state-of-the-art performance in many semi-supervised learning benchmarks. Laine et al. [11] proposed the temporal ensembling method to encourage the consensus between the exponential moving average (EMA) predictions and current predictions for unlabeled data. Tarvainen et al. [20] proposed the mean-teacher framework to force prediction consistency between current training model and the corresponding EMA model. Although semi-supervised learning has made great progress on utilizing the unlabeled data within the same domain, it leaves rich cross-modality data unexploited. Considering that multi-modality data is widely available in medical imaging field, recent works have studied on domain adaptation (DA) to leverage the shape priors of another modality for enhanced segmentation performance [5, 18, 9, 8] Among them, multi-modality learning (MML) exploits the labeled data from a related modality (i.e., source domain) to facilitate the segmentation on the modality of interest (i.e., target domain) [22, 21, 10, 3]. Valindria et al. [21] proposed a dual-stream approach to integrate the prior knowledge from unpaired multi-modality data for improved multi-organ segmentation, and suggested X-shape achieving the leading performance among all architectures. Since multi-modality learning requires annotations on two modality data, unsupervised domain adaptation (UDA) extends it with a broader application potential [16, 4, 2]. In UDA setting, source domain annotations are still required, while none target domain annotation is needed. Contemporary unsupervised domain adaptation methods attempt to extract domain-invariant representations, where Dou et al. [4] investigated in feature space and Chen et al. [2] explored both feature-level and image-level in a synergistic manner.

All approaches mentioned above have exhibited their feasibility in medical area. However, semi-supervised learning simply concentrates on leveraging the unlabeled data affiliated to the same domain as labeled ones, ignoring the rich prior knowledge (e.g., shape priors) cross modalities. While domain adaptation can utilize cross-modality prior knowledge, it still has considerable space for improvement. These motivate us to explore the feasibility of integrating the merits of both semi-supervised learning and domain adaptation by concurrently leveraging all available data resources, including limited labeled target data, abundant unlabeled target data and well-established labeled source data, to enhance the segmentation performance on target domain.

Figure 1: Overview of our framework. The student model learns from by the loss, and concurrently acquires the knowledge of from inter-domain teacher by knowledge distillation loss , as well as the knowledge of from intra-domain teacher by the consistency loss . In this way, the student model would integrate and leverage knowledge of , and simultaneously, leading to better generalization on target domain. In the inference phase, only the student model is used to predict.

In this paper, we propose a novel semi-supervised domain adaptation framework, namely Dual-Teacher, to simultaneously leverage abundant unlabeled data and widely-available cross-modality data to mitigate the need for tedious medical annotations. We implement it with the teacher-student framework [14] and adopt two teacher models in the network training, where one teacher guides the student model with intra-domain knowledge embedded in unlabeled target domain (e.g., CT), while another teacher instructs the student model with inter-domain knowledge beneath labeled source domain (e.g., MR). To be specific, our Dual-Teacher framework consists of three components: (1) intra-domain teacher, which employs the self-ensembling model of the student network to leverage unlabeled target data and transfers the acquired knowledge to student model by forcing prediction consistency; (2) inter-domain teacher, which adopts an image translation model, i.e., CycleGAN [24], to narrow the appearance gap cross modalities and transfers the prior knowledge in the source domain to student model via knowledge distillation; and (3) student model, which not only directly learns from limited labeled target data, but also grasps auxiliary intra-domain and inter-domain knowledge transferred from two teachers. Our whole framework is trained in an end-to-end manner to seamlessly integrate the knowledge of all data resources into the student model. We extensively evaluated our approach on MM-WHS 2017 dataset [25], and achieved superior performance compared to semi-supervised learning methods and domain adaptation methods.

2 Methodology

In our problem setting, we are given a set of source images and their annotations in source domain (e.g., labeled MR data) as . In addition, we are also given a limited number of annotated target domain samples (e.g., labeled CT data) as , and abundant unlabeled target domain data (e.g., unlabeled CT data) as . Normally, we assume is far less than . Our goal is to exploit  and  to enhance the performance in target domain (e.g., CT). Fig. 1 overviews our proposed Dual-Teacher framework, which consists of an inter-domain teacher model, an intra-domain teacher model, and a student model. The inter-domain teacher model and intra-domain teacher model explore the knowledge beneath and , respectively, and simultaneously transfer the knowledge to the student model for comprehensive integration and thorough exploitation.

2.1 Inter-domain Teacher

Despite the consistent shape priors shared between source domain (e.g., MR) and target domain (e.g., CT), they are distinct in many aspects like appearance and image distribution [17, 7]. Considering that, we attempt to reduce the appearance discrepancy first by using an appearance alignment module. Various image translation models can be adopted. Here we use CycleGAN [24] to translate source samples to synthetic target-style samples for synthetic target set . After appearance alignment, we input synthetic samples into the inter-domain teacher, which is implemented as a segmentation network. With the supervision of corresponding labels , the inter-domain teacher is able to learn the prior knowledge in source domain by following


where and denote cross-entropy loss and dice loss, respectively, and represents the inter-domain teacher predictions taking as inputs. To transfer the acquired knowledge from inter-domain teacher to the student, we further feed the same synthetic samples into both inter-domain teacher model and student model. Since the inter-domain teacher has acquired reliable source domain knowledge from its annotations, we encourage the student model to produce similar outputs as inter-domain teacher model via knowledge distillation loss . Following previous works [6, 1], we formulate as


where and represent the predictions of inter-domain teacher model and student model, respectively.

2.2 Intra-domain Teacher

As has no expert-annotated labels to directly guide network learning, recent works [20] propose to temporally ensemble the models in different training steps for reliable predictions. Inspired by them, we design the intra-domain teacher model following the same network architecture as student model and its weights are updated as the exponential moving average (EMA) of the student model weights in different training steps. Specifically, at training step , the weights of intra-domain teacher model are updated as


where is the EMA decay rate to control updating rate. To transfer the knowledge from intra-domain teacher to the student, we add different noise and to the same unlabeled sample and feed them into intra-domain teacher model and student model, respectively. Given small perturbation operations, e.g., Gaussian noise, the outputs between the student model and the corresponding EMA model (i.e., the intra-domain teacher model) should be the same. Therefore, we encourage them to generate consistent predictions via consistency loss as


where denotes the mean squared error loss. and represent the outputs of the student model (with weight and noise ) and intra-domain teacher model (with weight and noise ), respectively.

2.3 Student Model and Overall Training Strategies

For the student model, it explicitly learns from with the supervision of its labels via the segmentation loss . Meanwhile, it also concurrently acquires the knowledge of and from two teacher models and comprehensively integrates them as a united cohort. In particular, the student model attains inter-domain knowledge by knowledge distillation loss as Eq. (2), and intra-domain knowledge by prediction consistency loss as Eq. (4). Overall, the training objective for the student model is formulated as


where and

are hyperparameters to balance the weight of

and .

Our whole framework is updated in an end-to-end manner. We first optimize the inter-domain teacher model, then update the intra-domain teacher model with the EMA parameters of the student network, and optimize the student model in the last. In this way, no pre-training stage would be required and the student model updates its parameters synchronously along with teacher models in an online manner.

3 Experiments

3.0.1 Dataset and pre-processing

We evaluated our method in Multi-modality Whole Heart Segmentation (MM-WHS) 2017 dataset [25], which provided 20 annotated MR and 20 annotated CT volumes. We employed CT as target domain and MR as source domain, and randomly split 20 CT volumes into four folds to perform four-fold cross validation. In each fold, we validated on five CT volumes, and took 20 MR volumes as , five randomly chosen CT volumes as and the remaining 10 CT volumes as to train our framework. For pre-processing, we resampled all data with unit spacing and cropped them into centering at the heart region, following previous work [2]. To avoid overfitting, we applied on-the-fly data augmentation with random affine transformations and random rotation. We evaluated our method with dice coefficient on all seven heart substructures, including the left ventricle blood cavity (LV), the right ventricle blood cavity (RV), the left atrium blood cavity (LA), the right atrium blood cavity (RA), the myocardium of the left ventricle (MYO), the ascending aeorta (AA), and the pulmonary artery (PA) [25].

3.0.2 Implementation details

In our framework, the student model and two teacher models were implemented with the same network backbone, U-Net [19]. We empirically set as for inter-domain teacher. For intra-domain teacher, we closely followed the experiment configurations in previous work [23], where the EMA decay rate was set to and the hyperparameter was dynamically changed over time with the function , where and

denote the current and the last training epoch respectively and

is set to . To optimize the appearance alignment module, we followed the setting in [24] and used Adam optimizer with learning rate to optimize the student model and two teacher models until the network converge.

Method Avg Dice of heart substructures
Supervised-only () 0.7273 0.7113 0.7346 0.8086 0.7099 0.6524 0.8707 0.6037
Dou et al.[4] 0.6635 0.5664 0.7655 0.7654 0.6230 0.6600 0.7138 0.5505
Chen et al.[2] 0.7138 0.6573 0.8290 0.8306 0.7804 0.7082 0.7089 0.4827
Finetune 0.7313 0.7533 0.8081 0.7825 0.6412 0.5928 0.8466 0.6943
Joint training 0.7875 0.7816 0.8312 0.8469 0.7699 0.7008 0.8802 0.7019
X-shape [21] 0.7643 0.7317 0.8361 0.8432 0.7259 0.7453 0.8968 0.5709
MT[20] 0.8165 0.7764 0.8712 0.8748 0.7930 0.7051 0.9274 0.7677
Table 1: Comparison with other methods. The dice of all heart substructures and the average of them are reported here.
Figure 2: Visual comparisons with other methods. Due to page limit, we only present the methods with best mean dice in MML and UDA (i.e., Joint-training and Chen et al. [2]). As observed, our predictions are more similar to the ground truth than others.

3.0.3 Comparison with other methods

To demonstrate the effectiveness of our proposed semi-supervised domain adaptation method (SSDA) for leveraging unlabeled data and cross-modality data, we compare with both semi-supervised learning methods and domain adaptation methods. We first compare with the model trained with only limited labeled CT data (referred as Supervised-only), and take mean-teacher (MT) method [20] in semi-supervised learning (SSL) for comparison. For domain adaptation methods, besides straightforward methods like finetune and joint training, we also compare with X-shape model [21] in multi-modality learning (MML). Meanwhile, we consider two unsupervised domain adaptation methods (UDA) for comparisons, i.e., Dou et al. [4] and Chen et al. [2], which achieve the state-of-the-art performance in cardiac segmentation.

As presented in Table 1, the supervised-only model achieves in mean dice by taking only limited labeled target data in network training. When two types of data resources are available, UDA methods achieve comparable mean dice to the supervised-only model by utilizing and . Compared with supervised-only method, MML-based Joint training and SSL-based MT [20] methods further improve the segmentation performance with and in mean dice, respectively, demonstrating the effectiveness of leveraging cross-modality data and unlabeled data for improving segmentation performance. By simultaneously exploiting all of data resources, our Dual-Teacher outperforms the unsupervised domain adaptation, multi-modality learning and semi-supervised learning methods by a large margin, i.e., , and increase in mean dice respectively, validating the feasibility of our proposed semi-supervised domain adaptation approach.

We also present visual comparisons in Fig. 2. Due to page limit, we only present the predictions of the methods with best mean dice in MML (i.e., Joint-training) and UDA (i.e., Chen et al. [2]). It is observed that our method better identifies heart substructures with clean and accurate boundary, and produces less false positive predictions and more similar results to the ground truth compared with other methods.

Methods Mean Dice
No-Teacher Baseline 0.7330
GAN-baseline 0.7510
One-Teacher W/o inter-domain teacher 0.8477
W/o intra-domain teacher 0.7984
Dual-Teacher (Ours) 0.8604
Table 2: Analysis of our method. We report the mean dice of all cardiac substructures.

3.0.4 Analysis of our method

We further compare with other methods, which also utilize all three types of data in SSDA, and analyze the key components of our method in Table 2. For , , in SSDA, one straightforward method is to train and jointly, and deploy Pseudo-label method [12] to utilize , which is considered as our Baseline. A more effective version of baseline (referred as GAN-baseline) is using appearance alignment module (e.g., CycleGAN [24]) on to minimize appearance difference, and then following the previous routine by joint training synthetic target data along with and applying Pseudo-label method [12] for . For the Baseline and GAN-baseline, no teacher-student scheme is applied. Moreover, we conduct other experiments: (i) without inter-domain teacher, where we substitute it as a joint-training network attached with appearance alignment module to tackle and , and (ii) without intra-domain teacher, where we replace it with Pseudo-label method [12] to handle .

The results are shown in Table 2. Without any knowledge transfer from teacher models, neither the knowledge in or that in would be well-exploited. Since GAN-baseline adopts special treatments to narrow appearance gap, it performs better than the baseline model, but it still has large room for improvement compared to our method. Without the intra-domain teacher, the pseudo label bias will gradually accumulated and deteriorate the segmentation performance with lower than our Dual-Teacher framework in mean dice. Without the inter-domain teacher, the performance is lower than our method in mean dice, indicating that the prior knowledge of are not effectively utilized. These comparison results show that each teacher model plays a crucial role in our framework and further improvements could be achieved when combining them together.

4 Conclusion

We present a novel annotation-efficient semi-supervised domain adaptation framework for multi-modality cardiac segmentation. Our method integrates the inter-domain teacher model to leverage cross-modality priors from source domain, and the intra-domain teacher model to exploit the knowledge embedded in unlabeled target data. Both teacher models transfer the learnt knowledge into the student model, thereby seamlessly combining the merits of semi-supervised learning and domain adaptation. We extensively evaluated our method in MM-WHS 2017 dataset. Our method can simultaneously utilize cross-modality data and unlabeled data, and outperforms state-of-the-art semi-supervised and domain adaptation methods.

Acknowledgments. The work described in this paper was supported by Key-Area Research and Development Program of Guangdong Province, China under Project No. 2020B010165004, Hong Kong Innovation and Technology Fund under Project No. ITS/426/17FP and ITS/311/18FP and National Natural Science Foundation of China under Project No. U1813204.


  • [1] R. Anil, G. Pereyra, A. Passos, R. Ormandi, G. E. Dahl, and G. E. Hinton (2018) Large scale distributed neural network training through online distillation. arXiv preprint arXiv:1804.03235. Cited by: §2.1.
  • [2] C. Chen, Q. Dou, H. Chen, J. Qin, and P. Heng (2019) Synergistic image and feature adaptation: towards cross-modality domain adaptation for medical image segmentation. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 865–872. Cited by: §1, Figure 2, §3.0.1, §3.0.3, §3.0.3, Table 1.
  • [3] Q. Dou, Q. Liu, P. A. Heng, and B. Glocker (2020) Unpaired multi-modal segmentation via knowledge distillation. In IEEE Transactions on Medical Imaging, Cited by: §1.
  • [4] Q. Dou, C. Ouyang, C. Chen, H. Chen, and P. Heng (2018) Unsupervised cross-modality domain adaptation of convnets for biomedical image segmentations with adversarial loss. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 691–697. Cited by: §1, §3.0.3, Table 1.
  • [5] M. Ghafoorian, A. Mehrtash, T. Kapur, N. Karssemeijer, E. Marchiori, M. Pesteie, C. R. Guttmann, F. de Leeuw, C. M. Tempany, B. van Ginneken, et al. (2017) Transfer learning for domain adaptation in mri: application in brain lesion segmentation. In International conference on medical image computing and computer-assisted intervention, pp. 516–524. Cited by: §1.
  • [6] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.1.
  • [7] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell (2018) CyCADA: cycle-consistent adversarial domain adaptation. In

    International Conference on Machine Learning

    pp. 1989–1998. Cited by: §2.1.
  • [8] Y. Huo, Z. Xu, H. Moon, S. Bao, A. Assad, T. K. Moyo, M. R. Savona, R. G. Abramson, and B. A. Landman (2018) Synseg-net: synthetic segmentation without target modality ground truth. IEEE transactions on medical imaging 38 (4), pp. 1016–1025. Cited by: §1.
  • [9] J. Jiang, Y. Hu, N. Tyagi, P. Zhang, A. Rimner, G. S. Mageras, J. O. Deasy, and H. Veeraraghavan (2018) Tumor-aware, adversarial domain adaptation from ct to mri for lung cancer segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 777–785. Cited by: §1.
  • [10] J. Jue, H. Jason, T. Neelam, R. Andreas, B. L. Sean, D. O. Joseph, and V. Harini (2019) Integrating cross-modality hallucinated mri with ct to aid mediastinal lung tumor segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 221–229. Cited by: §1.
  • [11] S. Laine and T. Aila (2016) Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242. Cited by: §1.
  • [12] D. Lee (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, Vol. 3, pp. 2. Cited by: §1, §3.0.4.
  • [13] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. Sánchez (2017)

    A survey on deep learning in medical image analysis

    Medical image analysis 42, pp. 60–88. Cited by: §1.
  • [14] F. Liu, C. Deng, F. Bi, and Y. Yang (2016) Dual teaching: a practical semi-supervised wrapper method. arXiv preprint arXiv:1611.03981. Cited by: §1.
  • [15] F. Milletari, N. Navab, and S. Ahmadi (2016) V-net: fully convolutional neural networks for volumetric medical image segmentation. In 2016 Fourth International Conference on 3D Vision (3DV), pp. 565–571. Cited by: §1.
  • [16] M. Orbes-Arteainst, J. Cardoso, L. Sørensen, C. Igel, S. Ourselin, M. Modat, M. Nielsen, and A. Pai (2019) Knowledge distillation for semi-supervised domain adaptation. In OR 2.0 Context-Aware Operating Theaters and Machine Learning in Clinical Neuroimaging, pp. 68–76. Cited by: §1.
  • [17] S. J. Pan and Q. Yang (2009) A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22 (10), pp. 1345–1359. Cited by: §2.1.
  • [18] C. S. Perone, P. Ballester, R. C. Barros, and J. Cohen-Adad (2019) Unsupervised domain adaptation for medical imaging segmentation with self-ensembling. NeuroImage 194, pp. 1–11. Cited by: §1.
  • [19] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §1, §3.0.2.
  • [20] A. Tarvainen and H. Valpola (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, pp. 1195–1204. Cited by: §1, §2.2, §3.0.3, §3.0.3, Table 1.
  • [21] V. V. Valindria, N. Pawlowski, M. Rajchl, I. Lavdas, E. O. Aboagye, A. G. Rockall, D. Rueckert, and B. Glocker (2018) Multi-modal learning from unpaired images: application to multi-organ segmentation in ct and mri. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 547–556. Cited by: §1, §3.0.3, Table 1.
  • [22] G. Van Tulder and M. de Bruijne (2018) Learning cross-modality representations from multi-modal images. IEEE transactions on medical imaging 38 (2), pp. 638–648. Cited by: §1.
  • [23] L. Yu, S. Wang, X. Li, C. Fu, and P. Heng (2019) Uncertainty-aware self-ensembling model for semi-supervised 3d left atrium segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 605–613. Cited by: §3.0.2.
  • [24] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017)

    Unpaired image-to-image translation using cycle-consistent adversarial networks

    In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §1, §2.1, §3.0.2, §3.0.4.
  • [25] X. Zhuang, L. Li, C. Payer, D. Štern, M. Urschler, M. P. Heinrich, J. Oster, C. Wang, Ö. Smedby, C. Bian, et al. (2019) Evaluation of algorithms for multi-modality whole heart segmentation: an open-access grand challenge. Medical image analysis 58, pp. 101537. Cited by: §1, §3.0.1.