Generative Self-training for Cross-domain Unsupervised Tagged-to-Cine MRI Synthesis

by   Xiaofeng Liu, et al.

Self-training based unsupervised domain adaptation (UDA) has shown great potential to address the problem of domain shift, when applying a trained deep learning model in a source domain to unlabeled target domains. However, while the self-training UDA has demonstrated its effectiveness on discriminative tasks, such as classification and segmentation, via the reliable pseudo-label selection based on the softmax discrete histogram, the self-training UDA for generative tasks, such as image synthesis, is not fully investigated. In this work, we propose a novel generative self-training (GST) UDA framework with continuous value prediction and regression objective for cross-domain image synthesis. Specifically, we propose to filter the pseudo-label with an uncertainty mask, and quantify the predictive confidence of generated images with practical variational Bayes learning. The fast test-time adaptation is achieved by a round-based alternative optimization scheme. We validated our framework on the tagged-to-cine magnetic resonance imaging (MRI) synthesis problem, where datasets in the source and target domains were acquired from different scanners or centers. Extensive validations were carried out to verify our framework against popular adversarial training UDA methods. Results show that our GST, with tagged MRI of test subjects in new target domains, improved the synthesis quality by a large margin, compared with the adversarial training UDA methods.



There are no comments yet.


page 3

page 7

page 9


Unsupervised domain adaptation for cross-modality liver segmentation via joint adversarial learning and self-learning

Liver segmentation on images acquired using computed tomography (CT) and...

Self-semantic contour adaptation for cross modality brain tumor segmentation

Unsupervised domain adaptation (UDA) between two significantly disparate...

Self domain adapted network

Domain shift is a major problem for deploying deep networks in clinical ...

Detach and Adapt: Learning Cross-Domain Disentangled Deep Representation

While representation learning aims to derive interpretable features for ...

SENTRY: Selective Entropy Optimization via Committee Consistency for Unsupervised Domain Adaptation

Many existing approaches for unsupervised domain adaptation (UDA) focus ...

Cross-Domain Segmentation with Adversarial Loss and Covariate Shift for Biomedical Imaging

Despite the widespread use of deep learning methods for semantic segment...

Self-Supervision Meta-Learning for One-Shot Unsupervised Cross-Domain Detection

Deep detection models have largely demonstrated to be extremely powerful...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning has advanced state-of-the-art machine learning approaches and excelled at learning representations suitable for numerous discriminative and generative tasks

[29, 22, 14, 21]. However, a deep learning model trained on labeled data from a source domain, in general, performs poorly on unlabeled data from unseen target domains, partly because of discrepancies between source and target data distributions, i.e., domain shift [15]. The problem of domain shift in medical imaging arises, because data are often acquired from different scanners, protocols, or centers [17]. This issue has motivated many researchers to investigate unsupervised domain adaptation (UDA), which aims to transfer knowledge learned from a labeled source domain to different but related unlabeled target domains [30, 33].

There has been a great deal of work to alleviate the domain shift using UDA [30]. Early methods attempted to learn domain-invariant representations or to take instance importance into consideration to bridge the gap between the source and target domains. In addition, due to the ability of deep learning to disentangle explanatory factors of variations, efforts have been made to learn more transferable features. Recent works in UDA incorporated discrepancy measures into network architectures to align feature distributions between source and target domains [19, 18]

. This was achieved by either minimizing the distribution discrepancy between feature distribution statistics, e.g., maximum mean discrepancy (MMD), or adversarially learning the feature representations to fool a domain classifier in a two-player minimax game 


Recently, self-training based UDA presents a powerful means to counter unknown labels in the target domain [33], surpassing the adversarial learning-based methods in many discriminative UDA benchmarks, e.g., classification and segmentation (i.e., pixel-wise classification) [31, 23, 26]. The core idea behind the deep self-training based UDA is to iteratively generate a set of one-hot (or smoothed) pseudo-labels in the target domain, followed by retraining the network based on these pseudo-labels with target data [33]

. Since outputs of the previous round can be noisy, it is critical to only select the high confidence prediction as reliable pseudo-label. In discriminative self-training with softmax output unit and cross-entropy objective, it is natural to define the confidence for a sample as the max of its output softmax probabilities

[33]. Calibrating the uncertainty of the regression task, however, can be more challenging. Because of the insufficient target data and unreliable pseudo-labels, there can be both epistemic and aleatoric uncertainties [3] in self-training UDA. In addition, while the self-training UDA has demonstrated its effectiveness on classification and segmentation, via the reliable pseudo-label selection based on the softmax discrete histogram, the same approach for generative tasks, such as image synthesis, is underexplored.

In this work, we propose a novel generative self-training (GST) UDA framework with continuous value prediction and regression objective for tagged-to-cine magnetic resonance (MR) image synthesis. More specifically, we propose to filter the pseudo-label with an uncertainty mask, and quantify the predictive confidence of generated images with practical variational Bayes learning. The fast test-time adaptation is achieved by a round-based alternative optimization scheme. Our contributions are summarized as follows:

We propose to achieve cross-scanner and cross-center test-time UDA of tagged-to-cine MR image synthesis, which can potentially reduce the extra cine MRI acquisition time and cost.

A novel GST UDA scheme is proposed, which controls the confident pseudo-label (continuous value) selection with a practical Bayesian uncertainty mask. Both the aleatoric and epistemic uncertainties in GST UDA are investigated.

Both quantitative and qualitative evaluation results, using a total of 1,768 paired slices of tagged and cine MRI from the source domain and tagged MR slices of target subjects from the cross-scanner and cross-center target domain, demonstrate the validity of our proposed GST framework and its superiority to conventional adversarial training based UDA methods.

Figure 1: Illustration of our generative self-training UDA for tagged-to-cine MR image synthesis. In each iteration, two-step alternative training is carried out.

2 Methodology

In our setting of the UDA image synthesis, we have paired resized tagged MR images, , and cine MR images, , indexed by , from the source domain , and target samples from the unlabeled target domain , indexed by . In both training and testing, the ground-truth target labels, i.e., cine MR images in the target domain, are inaccessible, and the pseudo-label of is iteratively generated in a self-training scheme [33, 16]

. In this work, we adopt the U-Net-based Pix2Pix

[9] as our translator backbone, and initialize the network parameters with the pre-training using the labeled source domain . In what follows, alternative optimization based self-training is applied to gradually update the U-Net part for the target domain image synthesis by training on both and . Fig. 1 illustrates the proposed algorithm flow, which is detailed below.

2.1 Generative Self-training UDA

The conventional self-training regards the pseudo-label

as a learnable latent variable in the form of a categorical histogram, and assigns all-zero vector label for the uncertain samples or pixels to filter them out for loss calculation

[33, 16]. Since not all pseudo-labels are reliable, we define a confidence threshold to progressively select confident pseudo-labels [32]. This is akin to self-paced learning that learns samples in an easy-to-hard order [12, 27]. In classification or segmentation tasks, the confidence can be simply measured by the maximum softmax output histogram probability [33]. The output of a generation task, however, is continuous values and thus setting the pseudo-label as 0 cannot drop the uncertain sample in the regression loss calculation.

Therefore, we first propose to formulate the generative self-training as a unified regression loss minimization scheme, where pseudo-labels can be a pixel-wise continuous value and indicate the uncertain pixel with an uncertainty mask , where indexes the pixel in the images, and :


where . For example, indicates the -th pixel of the -th source domain ground-truth cine MR image . and represent the generated source and target images, respectively. and are the regression loss of the source and target domain samples, respectively. Notably, there is only one network parameterized with , which is updated with the loss in both domains.

is the to-be estimated uncertainty of a pixel and determines the value of the uncertainty mask

with a threshold . is a critical parameter to control pseudo-label learning and selection, which is determined by a single meta portion parameter , indicating the portion of pixels to be selected in the target domain. Empirically, we define in each iteration, by sorting in increasing order and set to minimum of the top percentile rank.

2.2 Bayesian Uncertainty Mask for Target Samples

Determining the mask value for the target sample requires the uncertainty estimation of in our self-training UDA. Notably, the lack of sufficient target domain data can result in the uncertainty w.r.t. the model parameters, while the noisy pseudo-label can lead to the uncertainty [3, 11, 8].

To counter this, we model the

uncertainty via Bayesian neural networks which learn a posterior distribution

over the probabilistic model parameters rather than a set of deterministic parameters [25]. In particular, a tractable solution is to replace the true posterior distribution with a variational approximation

, and dropout variational inference can be a practical technique. This can be seen as using the Bernoulli distribution as the approximation distribution

[5]. The times prediction with independent dropout sampling is referred to as Monte Carlo (MC) dropout. We use the mean squared error (MSE) to measure the epistemic uncertainty as in [25], which assesses a one-dimensional regression model similar to [4]. Therefore, the epistemic uncertainty with MSE of each pixel with times dropout generation is given by


where is the predictive mean of .

Because of the different hardness and divergence and because the pseudo-label noise can vary for different

, the heteroscedastic

uncertainty modeling is required [24, 13]. In this work, we use our network to transform , with its head split to predict both

and the variance map

; and its element is the predicted variance for the -th pixel. We do not need “uncertainty labels” to learn prediction. Rather, we can learn

implicitly from a regression loss function

[13, 11]. The masked regression loss can be formulated as


which consists of a variance normalized residual regression term and an uncertainty regularization term. The second regularization term keeps the network from predicting an infinite uncertainty, i.e., zero loss, for all the data points. Then, the averaged aleatoric uncertainty of times MC dropout can be measured by [13, 11].

Moreover, minimizing Eq. (4) can be regarded as the Lagrangian with a multiplier of 111It can be rewritten as . Since , an upper bound on can be obtained as ., where indicates the strength of the applied constraint. The condition term essentially controls the target domain predictive uncertainty, which is helpful for UDA [7]. Our final pixel-wise self-training UDA uncertainty is a combination of the two uncertainties [11].

2.3 Training Protocol

As pointed out in [6]

, directly optimizing the self-training objectives can be difficult and thus the deterministic annealing expectation maximization (EM) algorithms are often used instead. Specifically, the generative self-training can be solved by alternating optimization based on the following

a) and b) steps.

a) Pseudo-label and uncertainty mask generation.  With the current , apply the MC dropout for times image translation of each target domain tagged MR image . We estimate the pixel-wise uncertainty , and calculate the uncertainty mask with the threshold . We set the pseudo-label of the selected pixel in this round as , i.e., the average value of outputs.

b) Network retraining.  Fix , and solve:


to update . Carrying out step a) and b) for one time is defined as one round in self-training. Intuitively, step a) is equivalent to simultaneously conducting pseudo-label learning and selection. In order to solve step b)

, we can use a typical gradient method, e.g. Stochastic Gradient Descent (SGD). The meta parameter

is linearly increasing from 30% to 80% alongside the training to incorporate more pseudo-labels in the subsequent rounds as in [33].

3 Experiments and Results

Figure 2: Comparison of different UDA methods on the cross-scanner tagged-to-cine MR image synthesis task, including our proposed GST, GST-A, and GST-E, adversarial UDA [2]*, and Pix2Pix [9] without adaptation. * indicates the first attempt at tagged-to-cine MR image synthesis. GT indicates the ground-truth.

We evaluated our framework on both cross-scanner and cross-center tagged-to-cine MR image synthesis tasks. For the labeled source domain, a total of 1,768 paired tagged and cine MR images from 10 healthy subjects at clinical center A were acquired. We followed the test time UDA setting [10], which uses only one unlabeled target subject in UDA training and testing.

For fair comparison, we adopted Pix2Pix [9] for our source domain training as in [20], and used the trained U-Net as the source model for all of the comparison methods. In order to align the absolute value of each loss, we empirically set weight and

. Our framework was implemented using the PyTorch deep learning toolbox. The GST training was performed on a V100 GPU, which took about 30 min. We note that

times MC dropout can be processed parallel. In each iteration, we sampled the same number of source and target domain samples.

Cross-scanner Cross-center
Methods L1  SSIM  PSNR  IS  IS 
w/o UDA [9] 176.40.1 0.83250.0012 26.310.05 8.730.12 5.320.11
ADDA [28] 168.20.2 0.87840.0013 33.150.04 10.380.11 8.690.10
GAUDA [2] 161.70.1 0.88130.0012 33.270.06 10.620.13 8.830.14
GST 158.60.2 0.90780.0011 34.480.05 12.630.12 9.760.11
GST-A 159.50.3 0.89970.0011 34.030.04 12.030.12 9.540.13
GST-E 159.80.1 0.90260.0013 34.050.05 11.950.11 9.580.12
Table 1: Numerical comparisons of cross-scanner and cross-center evaluations. standard deviation is reported over three evaluations.

3.1 Cross-scanner tagged-to-cine MR image synthesis

In the cross-scanner image synthesis setting, a total of 1,014 paired tagged and cine MR images from 5 healthy subjects in the target domain were acquired at clinical center A with a different scanner. As a result, there was an appearance discrepancy between the source and target domains.

The synthesis results using source domain Pix2Pix [9] without UDA training, gradually adversarial UDA (GAUDA) [2], and our proposed framework are shown in Fig. 2. Note that GAUDA with source domain initialization took about 2 hours for the training, which was four times slower than our GST framework. In addition, it was challenging to stabilize the adversarial training [1], thus yielding checkerboard artifacts. Furthermore, the hallucinated content with the domain-wise distribution alignment loss produced a relatively significant difference in shape and texture within the tongue between the real cine MR images. By contrast, our framework achieved the adaptation with relatively limited target data in the test time UDA setting [10], with faster convergence time. In addition, our framework did not rely on adversarial training, generating visually pleasing results with better structural consistency as shown in Fig. 2, which is crucial for subsequent analyses such as segmentation.

For an ablation study, in Fig. 2, we show the performance of GST without the aleatoric or epistemic uncertainty for the uncertainty mask, i.e., GST-A or GST-E. Without measuring the aleatoric uncertainty caused by the inaccurate label, GST-A exhibited a small distortion of the shape and boundary. Without measuring the epistemic uncertainty, GST-E yielded noisier results than GST.

The synthesized images were expected to have realistic-looking textures, and to be structurally cohesive with their corresponding ground truth images. For quantitative evaluation, we adopted widely used evaluation metrics: mean L1 error, structural similarity index measure (SSIM), peak signal-to-noise ratio (PSNR), and unsupervised inception score (IS)

[20]. Table 1 lists numerical comparisons using 5 testing subjects. The proposed GST outperformed GAUDA [2] and ADDA [28] w.r.t. L1 error, SSIM, PSNR, and IS by a large margin.

3.2 Cross-center tagged-to-cine MR image synthesis

To further demonstrate the generality of our framework for the cross-center tagged-to-cine MR image synthesis task, we collected 120 tagged MR slices of a subject at clinical center B with a different scanner. As a result, the data at clinical center B had different soft tissue contrast and tag spacing, compared with clinical center A, and the head position was also different.

The qualitative results in Fig. 3 show that the anatomical structure of the tongue is better maintained using our framework with both the aleatoric and epistemic uncertainties. Due to the large domain gap present in the datasets between the two centers, the overall synthesis quality was not as good as the cross-scanner image synthesis task, as visually assessed. In Table 1, we provide the quantitative comparison using IS, which does not need the paired ground truth cine MR images [20]. Consistently with the cross-scanner setting, our GST outperformed adversarial training methods, including GAUDA and ADDA [2, 28], indicating the self-training can be a powerful technique for the generative UDA task, similar to the conventional discriminative self-training [33, 16].

Figure 3: Comparison of different UDA methods on the cross-center tagged-to-cine MR image synthesis task, including our proposed GST, GST-A, and GST-E, adversarial UDA [2]*, and Pix2Pix [9] without adaptation. * indicates the first attempt at tagged-to-cine MR image synthesis.

4 Discussion and Conclusion

In this work, we presented a novel generative self-training framework for UDA and applied the framework to cross-scanner and cross-center tagged-to-MR image synthesis tasks. With a practical yet principled Bayesian uncertainty mask, our framework was able to control the confident pseudo-label selection. In addition, we systematically investigated both the aleatoric and epistemic uncertainties in generative self-training UDA. Our experimental results demonstrated that our framework yielded the superior performance, compared with the popular adversarial training UDA methods, as quantitatively and qualitatively assessed. The synthesized cine MRI with test time UDA can potentially be used to segment the tongue and to observe surface motion, without the additional acquisition cost and time.


This work is supported by NIH R01DC014717, R01DC018511, and R01CA133015.


  • [1] T. Che, X. Liu, S. Li, Y. Ge, R. Zhang, C. Xiong, and Y. Bengio (2021) Deep verifier networks: verification of deep discriminative models with deep generative models. AAAI. Cited by: §3.1.
  • [2] S. Cui, S. Wang, J. Zhuo, C. Su, Q. Huang, and Q. Tian (2020) Gradually vanishing bridge for adversarial domain adaptation. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 12455–12464. Cited by: Figure 2, Figure 3, §3.1, §3.1, §3.2, Table 1.
  • [3] A. Der Kiureghian and O. Ditlevsen (2009) Aleatory or epistemic? does it matter?. Structural safety 31 (2), pp. 105–112. Cited by: §1, §2.2.
  • [4] W. Fruehwirt, A. D. Cobb, M. Mairhofer, L. Weydemann, H. Garn, R. Schmidt, T. Benke, P. Dal-Bianco, G. Ransmayr, M. Waser, et al. (2018) Bayesian deep neural networks for low-cost neurophysiological markers of alzheimer’s disease severity. arXiv preprint arXiv:1812.04994. Cited by: §2.2.
  • [5] Y. Gal and Z. Ghahramani (2015)

    Bayesian convolutional neural networks with bernoulli approximate variational inference

    arXiv preprint arXiv:1506.02158. Cited by: §2.2.
  • [6] Y. Grandvalet and Y. Bengio (2006) Entropy regularization.. Cited by: §2.3.
  • [7] L. Han, Y. Zou, R. Gao, L. Wang, and D. Metaxas (2019) Unsupervised domain adaptation via calibrating uncertainties. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 99–102. Cited by: §2.2.
  • [8] S. Hu, D. Worrall, S. Knegt, B. Veeling, H. Huisman, and M. Welling (2019) Supervised uncertainty quantification for segmentation with multiple annotations. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 137–145. Cited by: §2.2.
  • [9] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In CVPR, pp. 1125–1134. Cited by: §2, Figure 2, Figure 3, §3.1, Table 1, §3.
  • [10] N. Karani, E. Erdil, K. Chaitanya, and E. Konukoglu (2021) Test-time adaptable neural networks for robust medical image segmentation. Medical Image Analysis 68, pp. 101907. Cited by: §3.1, §3.
  • [11] A. Kendall and Y. Gal (2017) What uncertainties do we need in bayesian deep learning for computer vision?. arXiv preprint arXiv:1703.04977. Cited by: §2.2, §2.2, §2.2.
  • [12] M. P. Kumar, B. Packer, and D. Koller (2010) Self-paced learning for latent variable models. In Advances in Neural Information Processing Systems, pp. 1189–1197. Cited by: §2.1.
  • [13] Q. V. Le, A. J. Smola, and S. Canu (2005) Heteroscedastic gaussian process regression. In Proceedings of the 22nd international conference on Machine learning, pp. 489–496. Cited by: §2.2.
  • [14] X. Liu, F. Fan, L. Kong, Z. Diao, W. Xie, J. Lu, and J. You (2020)

    Unimodal regularized neuron stick-breaking for ordinal classification

    Neurocomputing 388, pp. 34–44. Cited by: §1.
  • [15] X. Liu, B. Hu, L. Jin, X. Han, F. Xing, J. Ouyang, J. Lu, G. El Fakhri, and J. Woo (2021)

    Domain generalization under conditional and label shifts via variational bayesian inference

    In IJCAI, Cited by: §1.
  • [16] X. Liu, B. Hu, X. Liu, J. Lu, J. You, and L. Kong (2020) Energy-constrained self-training for unsupervised domain adaptation. ICPR. Cited by: §2.1, §2, §3.2.
  • [17] X. Liu, X. Liu, B. Hu, W. Ji, F. Xing, J. Lu, J. You, C. J. Kuo, G. E. Fakhri, and J. Woo (2021) Subtype-aware unsupervised domain adaptation for medical diagnosis. AAAI. Cited by: §1.
  • [18] X. Liu, F. Xing, G. El Fakhri, and J. Woo (2021) A unified conditional disentanglement framework for multimodal brain mr image translation. In ISBI, pp. 10–14. Cited by: §1.
  • [19] X. Liu, F. Xing, G. El Fakhri, and J. Woo (2021) Adapting off-the-shelf source segmenter for target medical image segmentation. In MICCAI, Cited by: §1.
  • [20] X. Liu, F. Xing, J. L. Prince, A. Carass, M. Stone, G. E. Fakhri, and J. Woo (2021) Dual-cycle constrained bijective VAE-GAN for tagged-to-cine magnetic resonance image synthesis. ISBI. Cited by: §3.1, §3.2, §3.
  • [21] X. Liu, F. Xing, C. Yang, C. J. Kuo, G. El Fakhri, and J. Woo (2021) Symmetric-constrained irregular structure inpainting for brain mri registration with tumor pathology. In Brainlesion: glioma, multiple sclerosis, stroke and traumatic brain injuries. BrainLes (Workshop), Vol. 12658, pp. 80. Cited by: §1.
  • [22] X. Liu, Y. Zou, Y. Song, C. Yang, J. You, and B. K Vijaya Kumar (2018) Ordinal regression with neuron stick-breaking for medical diagnosis. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0. Cited by: §1.
  • [23] K. Mei, C. Zhu, J. Zou, and S. Zhang (2020) Instance adaptive self-training for unsupervised domain adaptation. ECCV. Cited by: §1.
  • [24] D. A. Nix and A. S. Weigend (1994)

    Estimating the mean and variance of the target probability distribution

    In Proceedings of 1994 ieee international conference on neural networks (ICNN’94), Vol. 1, pp. 55–60. Cited by: §2.2.
  • [25] C. E. Rasmussen (2003) Gaussian processes in machine learning. In Summer school on machine learning, pp. 63–71. Cited by: §2.2.
  • [26] I. Shin, S. Woo, F. Pan, and I. S. Kweon (2020) Two-phase pseudo label densification for self-training based domain adaptation. In European Conference on Computer Vision, pp. 532–548. Cited by: §1.
  • [27] K. Tang, V. Ramanathan, L. Fei-Fei, and D. Koller (2012) Shifting weights: adapting object detectors from image to video. In NIPS, Cited by: §2.1.
  • [28] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell (2017) Adversarial discriminative domain adaptation. In CVPR, Cited by: §3.1, §3.2, Table 1.
  • [29] J. Wang, X. Liu, F. Wang, L. Zheng, F. Gao, H. Zhang, X. Zhang, W. Xie, and B. Wang (2021) Automated interpretation of congenital heart disease from multi-view echocardiograms. Medical Image Analysis 69, pp. 101942. Cited by: §1.
  • [30] M. Wang and W. Deng (2018) Deep visual domain adaptation: a survey. Neurocomputing 312, pp. 135–153. Cited by: §1, §1.
  • [31] C. Wei, K. Shen, Y. Chen, and T. Ma (2021) Theoretical analysis of self-training with deep networks on unlabeled data. arXiv preprint arXiv:2010.03622. Cited by: §1.
  • [32] X. Zhu (2007) Semi-supervised learning tutorial. In ICML tutorial, Cited by: §2.1.
  • [33] Y. Zou, Z. Yu, X. Liu, B. Kumar, and J. Wang (2019) Confidence regularized self-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5982–5991. Cited by: §1, §1, §2.1, §2.3, §2, §3.2.