1 Introduction
Deep learning has advanced stateoftheart machine learning approaches and excelled at learning representations suitable for numerous discriminative and generative tasks
[29, 22, 14, 21]. However, a deep learning model trained on labeled data from a source domain, in general, performs poorly on unlabeled data from unseen target domains, partly because of discrepancies between source and target data distributions, i.e., domain shift [15]. The problem of domain shift in medical imaging arises, because data are often acquired from different scanners, protocols, or centers [17]. This issue has motivated many researchers to investigate unsupervised domain adaptation (UDA), which aims to transfer knowledge learned from a labeled source domain to different but related unlabeled target domains [30, 33].There has been a great deal of work to alleviate the domain shift using UDA [30]. Early methods attempted to learn domaininvariant representations or to take instance importance into consideration to bridge the gap between the source and target domains. In addition, due to the ability of deep learning to disentangle explanatory factors of variations, efforts have been made to learn more transferable features. Recent works in UDA incorporated discrepancy measures into network architectures to align feature distributions between source and target domains [19, 18]
. This was achieved by either minimizing the distribution discrepancy between feature distribution statistics, e.g., maximum mean discrepancy (MMD), or adversarially learning the feature representations to fool a domain classifier in a twoplayer minimax game
[19].Recently, selftraining based UDA presents a powerful means to counter unknown labels in the target domain [33], surpassing the adversarial learningbased methods in many discriminative UDA benchmarks, e.g., classification and segmentation (i.e., pixelwise classification) [31, 23, 26]. The core idea behind the deep selftraining based UDA is to iteratively generate a set of onehot (or smoothed) pseudolabels in the target domain, followed by retraining the network based on these pseudolabels with target data [33]
. Since outputs of the previous round can be noisy, it is critical to only select the high confidence prediction as reliable pseudolabel. In discriminative selftraining with softmax output unit and crossentropy objective, it is natural to define the confidence for a sample as the max of its output softmax probabilities
[33]. Calibrating the uncertainty of the regression task, however, can be more challenging. Because of the insufficient target data and unreliable pseudolabels, there can be both epistemic and aleatoric uncertainties [3] in selftraining UDA. In addition, while the selftraining UDA has demonstrated its effectiveness on classification and segmentation, via the reliable pseudolabel selection based on the softmax discrete histogram, the same approach for generative tasks, such as image synthesis, is underexplored.In this work, we propose a novel generative selftraining (GST) UDA framework with continuous value prediction and regression objective for taggedtocine magnetic resonance (MR) image synthesis. More specifically, we propose to filter the pseudolabel with an uncertainty mask, and quantify the predictive confidence of generated images with practical variational Bayes learning. The fast testtime adaptation is achieved by a roundbased alternative optimization scheme. Our contributions are summarized as follows:
We propose to achieve crossscanner and crosscenter testtime UDA of taggedtocine MR image synthesis, which can potentially reduce the extra cine MRI acquisition time and cost.
A novel GST UDA scheme is proposed, which controls the confident pseudolabel (continuous value) selection with a practical Bayesian uncertainty mask. Both the aleatoric and epistemic uncertainties in GST UDA are investigated.
Both quantitative and qualitative evaluation results, using a total of 1,768 paired slices of tagged and cine MRI from the source domain and tagged MR slices of target subjects from the crossscanner and crosscenter target domain, demonstrate the validity of our proposed GST framework and its superiority to conventional adversarial training based UDA methods.
2 Methodology
In our setting of the UDA image synthesis, we have paired resized tagged MR images, , and cine MR images, , indexed by , from the source domain , and target samples from the unlabeled target domain , indexed by . In both training and testing, the groundtruth target labels, i.e., cine MR images in the target domain, are inaccessible, and the pseudolabel of is iteratively generated in a selftraining scheme [33, 16]
. In this work, we adopt the UNetbased Pix2Pix
[9] as our translator backbone, and initialize the network parameters with the pretraining using the labeled source domain . In what follows, alternative optimization based selftraining is applied to gradually update the UNet part for the target domain image synthesis by training on both and . Fig. 1 illustrates the proposed algorithm flow, which is detailed below.2.1 Generative Selftraining UDA
The conventional selftraining regards the pseudolabel
as a learnable latent variable in the form of a categorical histogram, and assigns allzero vector label for the uncertain samples or pixels to filter them out for loss calculation
[33, 16]. Since not all pseudolabels are reliable, we define a confidence threshold to progressively select confident pseudolabels [32]. This is akin to selfpaced learning that learns samples in an easytohard order [12, 27]. In classification or segmentation tasks, the confidence can be simply measured by the maximum softmax output histogram probability [33]. The output of a generation task, however, is continuous values and thus setting the pseudolabel as 0 cannot drop the uncertain sample in the regression loss calculation.Therefore, we first propose to formulate the generative selftraining as a unified regression loss minimization scheme, where pseudolabels can be a pixelwise continuous value and indicate the uncertain pixel with an uncertainty mask , where indexes the pixel in the images, and :
(1)  
(2) 
where . For example, indicates the th pixel of the th source domain groundtruth cine MR image . and represent the generated source and target images, respectively. and are the regression loss of the source and target domain samples, respectively. Notably, there is only one network parameterized with , which is updated with the loss in both domains.
is the tobe estimated uncertainty of a pixel and determines the value of the uncertainty mask
with a threshold . is a critical parameter to control pseudolabel learning and selection, which is determined by a single meta portion parameter , indicating the portion of pixels to be selected in the target domain. Empirically, we define in each iteration, by sorting in increasing order and set to minimum of the top percentile rank.2.2 Bayesian Uncertainty Mask for Target Samples
Determining the mask value for the target sample requires the uncertainty estimation of in our selftraining UDA. Notably, the lack of sufficient target domain data can result in the uncertainty w.r.t. the model parameters, while the noisy pseudolabel can lead to the uncertainty [3, 11, 8].
To counter this, we model the
uncertainty via Bayesian neural networks which learn a posterior distribution
over the probabilistic model parameters rather than a set of deterministic parameters [25]. In particular, a tractable solution is to replace the true posterior distribution with a variational approximation, and dropout variational inference can be a practical technique. This can be seen as using the Bernoulli distribution as the approximation distribution
[5]. The times prediction with independent dropout sampling is referred to as Monte Carlo (MC) dropout. We use the mean squared error (MSE) to measure the epistemic uncertainty as in [25], which assesses a onedimensional regression model similar to [4]. Therefore, the epistemic uncertainty with MSE of each pixel with times dropout generation is given by(3) 
where is the predictive mean of .
Because of the different hardness and divergence and because the pseudolabel noise can vary for different
, the heteroscedastic
uncertainty modeling is required [24, 13]. In this work, we use our network to transform , with its head split to predict bothand the variance map
; and its element is the predicted variance for the th pixel. We do not need “uncertainty labels” to learn prediction. Rather, we can learnimplicitly from a regression loss function
[13, 11]. The masked regression loss can be formulated as(4) 
which consists of a variance normalized residual regression term and an uncertainty regularization term. The second regularization term keeps the network from predicting an infinite uncertainty, i.e., zero loss, for all the data points. Then, the averaged aleatoric uncertainty of times MC dropout can be measured by [13, 11].
Moreover, minimizing Eq. (4) can be regarded as the Lagrangian with a multiplier of ^{1}^{1}1It can be rewritten as . Since , an upper bound on can be obtained as ., where indicates the strength of the applied constraint. The condition term essentially controls the target domain predictive uncertainty, which is helpful for UDA [7]. Our final pixelwise selftraining UDA uncertainty is a combination of the two uncertainties [11].
2.3 Training Protocol
As pointed out in [6]
, directly optimizing the selftraining objectives can be difficult and thus the deterministic annealing expectation maximization (EM) algorithms are often used instead. Specifically, the generative selftraining can be solved by alternating optimization based on the following
a) and b) steps.a) Pseudolabel and uncertainty mask generation. With the current , apply the MC dropout for times image translation of each target domain tagged MR image . We estimate the pixelwise uncertainty , and calculate the uncertainty mask with the threshold . We set the pseudolabel of the selected pixel in this round as , i.e., the average value of outputs.
b) Network retraining. Fix , and solve:
(5) 
to update . Carrying out step a) and b) for one time is defined as one round in selftraining. Intuitively, step a) is equivalent to simultaneously conducting pseudolabel learning and selection. In order to solve step b)
, we can use a typical gradient method, e.g. Stochastic Gradient Descent (SGD). The meta parameter
is linearly increasing from 30% to 80% alongside the training to incorporate more pseudolabels in the subsequent rounds as in [33].3 Experiments and Results
We evaluated our framework on both crossscanner and crosscenter taggedtocine MR image synthesis tasks. For the labeled source domain, a total of 1,768 paired tagged and cine MR images from 10 healthy subjects at clinical center A were acquired. We followed the test time UDA setting [10], which uses only one unlabeled target subject in UDA training and testing.
For fair comparison, we adopted Pix2Pix [9] for our source domain training as in [20], and used the trained UNet as the source model for all of the comparison methods. In order to align the absolute value of each loss, we empirically set weight and
. Our framework was implemented using the PyTorch deep learning toolbox. The GST training was performed on a V100 GPU, which took about 30 min. We note that
times MC dropout can be processed parallel. In each iteration, we sampled the same number of source and target domain samples.Crossscanner  Crosscenter  
Methods  L1  SSIM  PSNR  IS  IS 
w/o UDA [9]  176.40.1  0.83250.0012  26.310.05  8.730.12  5.320.11 
ADDA [28]  168.20.2  0.87840.0013  33.150.04  10.380.11  8.690.10 
GAUDA [2]  161.70.1  0.88130.0012  33.270.06  10.620.13  8.830.14 
GST  158.60.2  0.90780.0011  34.480.05  12.630.12  9.760.11 
GSTA  159.50.3  0.89970.0011  34.030.04  12.030.12  9.540.13 
GSTE  159.80.1  0.90260.0013  34.050.05  11.950.11  9.580.12 
3.1 Crossscanner taggedtocine MR image synthesis
In the crossscanner image synthesis setting, a total of 1,014 paired tagged and cine MR images from 5 healthy subjects in the target domain were acquired at clinical center A with a different scanner. As a result, there was an appearance discrepancy between the source and target domains.
The synthesis results using source domain Pix2Pix [9] without UDA training, gradually adversarial UDA (GAUDA) [2], and our proposed framework are shown in Fig. 2. Note that GAUDA with source domain initialization took about 2 hours for the training, which was four times slower than our GST framework. In addition, it was challenging to stabilize the adversarial training [1], thus yielding checkerboard artifacts. Furthermore, the hallucinated content with the domainwise distribution alignment loss produced a relatively significant difference in shape and texture within the tongue between the real cine MR images. By contrast, our framework achieved the adaptation with relatively limited target data in the test time UDA setting [10], with faster convergence time. In addition, our framework did not rely on adversarial training, generating visually pleasing results with better structural consistency as shown in Fig. 2, which is crucial for subsequent analyses such as segmentation.
For an ablation study, in Fig. 2, we show the performance of GST without the aleatoric or epistemic uncertainty for the uncertainty mask, i.e., GSTA or GSTE. Without measuring the aleatoric uncertainty caused by the inaccurate label, GSTA exhibited a small distortion of the shape and boundary. Without measuring the epistemic uncertainty, GSTE yielded noisier results than GST.
The synthesized images were expected to have realisticlooking textures, and to be structurally cohesive with their corresponding ground truth images. For quantitative evaluation, we adopted widely used evaluation metrics: mean L1 error, structural similarity index measure (SSIM), peak signaltonoise ratio (PSNR), and unsupervised inception score (IS)
[20]. Table 1 lists numerical comparisons using 5 testing subjects. The proposed GST outperformed GAUDA [2] and ADDA [28] w.r.t. L1 error, SSIM, PSNR, and IS by a large margin.3.2 Crosscenter taggedtocine MR image synthesis
To further demonstrate the generality of our framework for the crosscenter taggedtocine MR image synthesis task, we collected 120 tagged MR slices of a subject at clinical center B with a different scanner. As a result, the data at clinical center B had different soft tissue contrast and tag spacing, compared with clinical center A, and the head position was also different.
The qualitative results in Fig. 3 show that the anatomical structure of the tongue is better maintained using our framework with both the aleatoric and epistemic uncertainties. Due to the large domain gap present in the datasets between the two centers, the overall synthesis quality was not as good as the crossscanner image synthesis task, as visually assessed. In Table 1, we provide the quantitative comparison using IS, which does not need the paired ground truth cine MR images [20]. Consistently with the crossscanner setting, our GST outperformed adversarial training methods, including GAUDA and ADDA [2, 28], indicating the selftraining can be a powerful technique for the generative UDA task, similar to the conventional discriminative selftraining [33, 16].
4 Discussion and Conclusion
In this work, we presented a novel generative selftraining framework for UDA and applied the framework to crossscanner and crosscenter taggedtoMR image synthesis tasks. With a practical yet principled Bayesian uncertainty mask, our framework was able to control the confident pseudolabel selection. In addition, we systematically investigated both the aleatoric and epistemic uncertainties in generative selftraining UDA. Our experimental results demonstrated that our framework yielded the superior performance, compared with the popular adversarial training UDA methods, as quantitatively and qualitatively assessed. The synthesized cine MRI with test time UDA can potentially be used to segment the tongue and to observe surface motion, without the additional acquisition cost and time.
Acknowledgements
This work is supported by NIH R01DC014717, R01DC018511, and R01CA133015.
References
 [1] (2021) Deep verifier networks: verification of deep discriminative models with deep generative models. AAAI. Cited by: §3.1.

[2]
(2020)
Gradually vanishing bridge for adversarial domain adaptation.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp. 12455–12464. Cited by: Figure 2, Figure 3, §3.1, §3.1, §3.2, Table 1.  [3] (2009) Aleatory or epistemic? does it matter?. Structural safety 31 (2), pp. 105–112. Cited by: §1, §2.2.
 [4] (2018) Bayesian deep neural networks for lowcost neurophysiological markers of alzheimer’s disease severity. arXiv preprint arXiv:1812.04994. Cited by: §2.2.

[5]
(2015)
Bayesian convolutional neural networks with bernoulli approximate variational inference
. arXiv preprint arXiv:1506.02158. Cited by: §2.2.  [6] (2006) Entropy regularization.. Cited by: §2.3.
 [7] (2019) Unsupervised domain adaptation via calibrating uncertainties. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 99–102. Cited by: §2.2.
 [8] (2019) Supervised uncertainty quantification for segmentation with multiple annotations. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pp. 137–145. Cited by: §2.2.
 [9] (2017) Imagetoimage translation with conditional adversarial networks. In CVPR, pp. 1125–1134. Cited by: §2, Figure 2, Figure 3, §3.1, Table 1, §3.
 [10] (2021) Testtime adaptable neural networks for robust medical image segmentation. Medical Image Analysis 68, pp. 101907. Cited by: §3.1, §3.
 [11] (2017) What uncertainties do we need in bayesian deep learning for computer vision?. arXiv preprint arXiv:1703.04977. Cited by: §2.2, §2.2, §2.2.
 [12] (2010) Selfpaced learning for latent variable models. In Advances in Neural Information Processing Systems, pp. 1189–1197. Cited by: §2.1.
 [13] (2005) Heteroscedastic gaussian process regression. In Proceedings of the 22nd international conference on Machine learning, pp. 489–496. Cited by: §2.2.

[14]
(2020)
Unimodal regularized neuron stickbreaking for ordinal classification
. Neurocomputing 388, pp. 34–44. Cited by: §1. 
[15]
(2021)
Domain generalization under conditional and label shifts via variational bayesian inference
. In IJCAI, Cited by: §1.  [16] (2020) Energyconstrained selftraining for unsupervised domain adaptation. ICPR. Cited by: §2.1, §2, §3.2.
 [17] (2021) Subtypeaware unsupervised domain adaptation for medical diagnosis. AAAI. Cited by: §1.
 [18] (2021) A unified conditional disentanglement framework for multimodal brain mr image translation. In ISBI, pp. 10–14. Cited by: §1.
 [19] (2021) Adapting offtheshelf source segmenter for target medical image segmentation. In MICCAI, Cited by: §1.
 [20] (2021) Dualcycle constrained bijective VAEGAN for taggedtocine magnetic resonance image synthesis. ISBI. Cited by: §3.1, §3.2, §3.
 [21] (2021) Symmetricconstrained irregular structure inpainting for brain mri registration with tumor pathology. In Brainlesion: glioma, multiple sclerosis, stroke and traumatic brain injuries. BrainLes (Workshop), Vol. 12658, pp. 80. Cited by: §1.
 [22] (2018) Ordinal regression with neuron stickbreaking for medical diagnosis. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0. Cited by: §1.
 [23] (2020) Instance adaptive selftraining for unsupervised domain adaptation. ECCV. Cited by: §1.

[24]
(1994)
Estimating the mean and variance of the target probability distribution
. In Proceedings of 1994 ieee international conference on neural networks (ICNN’94), Vol. 1, pp. 55–60. Cited by: §2.2.  [25] (2003) Gaussian processes in machine learning. In Summer school on machine learning, pp. 63–71. Cited by: §2.2.
 [26] (2020) Twophase pseudo label densification for selftraining based domain adaptation. In European Conference on Computer Vision, pp. 532–548. Cited by: §1.
 [27] (2012) Shifting weights: adapting object detectors from image to video. In NIPS, Cited by: §2.1.
 [28] (2017) Adversarial discriminative domain adaptation. In CVPR, Cited by: §3.1, §3.2, Table 1.
 [29] (2021) Automated interpretation of congenital heart disease from multiview echocardiograms. Medical Image Analysis 69, pp. 101942. Cited by: §1.
 [30] (2018) Deep visual domain adaptation: a survey. Neurocomputing 312, pp. 135–153. Cited by: §1, §1.
 [31] (2021) Theoretical analysis of selftraining with deep networks on unlabeled data. arXiv preprint arXiv:2010.03622. Cited by: §1.
 [32] (2007) Semisupervised learning tutorial. In ICML tutorial, Cited by: §2.1.
 [33] (2019) Confidence regularized selftraining. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5982–5991. Cited by: §1, §1, §2.1, §2.3, §2, §3.2.
Comments
There are no comments yet.