Unsupervised domain adaptation for medical imaging segmentation with self-ensembling

by   Christian S. Perone, et al.

Recent deep learning methods for the medical imaging domain have reached state-of-the-art results and even surpassed human judgment in several tasks. Those models, however, when trained to reduce the empirical risk on a single domain, fail to generalize when applied on other domains, a very common scenario on medical imaging due to the variability of images and anatomical structures, even across the same imaging modality. In this work, we extend the method of unsupervised domain adaptation using self-ensembling for the semantic segmentation task and explore multiple facets of the method on a realistic small data regime using a publicly available magnetic resonance (MRI) dataset. Through an extensive evaluation, we show that self-ensembling can indeed improve the generalization of the models even when using a small amount of unlabelled data.



There are no comments yet.


page 2

page 6

page 7


A Strong Baseline for Domain Adaptation and Generalization in Medical Imaging

This work provides a strong baseline for the problem of multi-source mul...

Improving Robustness of Deep Learning Based Knee MRI Segmentation: Mixup and Adversarial Domain Adaptation

Degeneration of articular cartilage (AC) is actively studied in knee ost...

Mitigating domain shift in AI-based tuberculosis screening with unsupervised domain adaptation

We demonstrate that Domain Invariant Feature Learning (DIFL) can improve...

ivadomed: A Medical Imaging Deep Learning Toolbox

ivadomed is an open-source Python package for designing, end-to-end trai...

A Comparative Study of CNN, BoVW and LBP for Classification of Histopathological Images

Despite the progress made in the field of medical imaging, it remains a ...

Syn2Real: Forgery Classification via Unsupervised Domain Adaptation

In recent years, image manipulation is becoming increasingly more access...

Domain Adaptation via CycleGAN for Retina Segmentation in Optical Coherence Tomography

With the FDA approval of Artificial Intelligence (AI) for point-of-care ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the past few years, the research community has witnessed the fast developmental pace of deep learning [LeCun . (2015)]

approaches for unstructured data analysis, arguably establishing an important scientific milestone. Deep neural networks constitute a paradigmatic shift from traditional machine learning approaches for unstructured data. Whereas the latter rely on hand-crafted feature engineering for improving learning over images, text, audio, and similar unstructured input spaces, deep neural networks are capable of automatically learning robust hierarchical features, in what is known today as

representation learning. Deep learning approaches have achieved human-level performance on many tasks, sometimes surpassing it on applications such as natural image classification [He . (2016)], or arrhythmia detection from medical imaging  [Rajpurkar . (2017)].

Due to its popularity and strong results in many domains, deep learning attracted a lot of attention from the medical imaging community. A recent survey by Litjens et al. [Litjens . (2017)] analyzes more than 300 studies from the medical imaging domain, and the authors found out that deep neural networks became pervasive throughout the entire field of medical imaging, with a significant increase in the number of published studies between 2015 and 2016. The survey also identifies that the most addressed task is image segmentation, potentially due to the importance of quantification of anatomical structures and pathologies [Gros . (2018)], as opposed to less informative tasks such as classification of pathologies or detection of structures.

Deep neural networks are thus becoming the norm in the medical imaging field, though there are still several unsolved challenges that need to be properly addressed. For instance, one of the most well-known problems is the high sample complexity, or how much data deep learning requires to accurately learn and perform well on unseen images, which is linked to the concepts of model complexity and generalization, an active research topic in learning theory [Neyshabur . (2017)].

The large amount of required data to train deep neural networks can be partially mitigated with techniques such as transfer learning 

[Yosinski . (2014), Zamir . (2018)]. However, transfer learning is problematic in medical imaging because a large dataset is still required so the models can benefit from the inductive transfer process. Differently from natural images, where annotations can be easily and quickly provided by non-experts, medical images require careful and time-consuming analysis from trained experts such as radiologists.

Yet another challenge when deploying deep learning models to medical imaging analysis – and perhaps one of the most difficult to solve – is the so-called data distribution shift: variability inherent to the different imaging protocols can result in significantly different data distributions. Therefore, models trained under the empirical risk minimization (ERM) principle, might fail to generalize to other domains due to its strong assumptions. ERM is the statistical learning principle behind many machine learning methods, and it offers good learning guarantees and bounds if its assumptions hold, such as the fact that the train and test datasets come from the same domain. However, as we saw, this assumption is usually broken on real application scenarios.

When a deep learning model that assumes independent and identically-distributed (iid) data is trained with images from one domain and then it is deployed on images from a different domain (e.g., distinct center), which follow a distinct probability distribution function, its performance degrades by a large margin. A concrete example of domain shift can be found in magnetic resonance imaging (MRI), where the same machine vendor using the same protocol for the same subject can produce different images. Variability is also much more salient between different centers where there are differences in machine vendor, protocol, and resolution, among others. A visual example of inter-center differences in data distribution can be seen in Figure 

1, where we show samples of different centers from the Gray Matter (GM) segmentation challenge dataset [Prados . (2017)]. In Figure 2, we show the voxel intensity distribution for the same dataset.

Figure 1: MRI axial-slice samples from four different centers (UCL, Montreal, Zurich, Vanderbilt) that collaborated to the SCGM Segmentation Challenge [Prados . (2017)], reproduced from [Perone  Cohen-Adad (20182)]. Top row: original MRI images. Bottom row: crop of the spinal cord (green rectangle). Best viewed in color.
Figure 2: MRI axial-slice pixel intensity distribution from four different centers (UCL, Montreal, Zurich, Vanderbilt) that collaborated to the SCGM Segmentation Challenge [Prados . (2017)].

Although this distribution shift is very common in medical imaging, the problem is surprisingly ignored during the design of many different challenges in the field. It is very common to have the same domain data (same machine, protocol, etc) on both training and test sets. However, this homogeneous data split often does not represent the reality and in many cases will produce over-optimistic evaluation results. On realistic scenarios, it is very rare to have labeled data available from a new center before training a model, and it is very common to use a pre-trained model from a different domain on completely different data. Therefore it is paramount to have a proper evaluation avoid contaminating the test set with data from the same domain that is present on the training set. Incurring the risk of the detrimental effects of inadequate evaluations [Zech . (2018)]

. The name given to learn a classifier model or any other predictor with a shift between the training and the target/test distributions is known as “domain adaptation” (DA). In this work we expand a previously-developed method 

[French . (2017)] for DA based on the Mean Teacher [Tarvainen  Valpola (2017)] approach, to segmentation tasks, the most addressed task in medical imaging.

We provide the following contributions: we extend the unsupervised domain adaptation method using self-ensembling for the semantic segmentation task. To the best of our knowledge, this is the first time this method is used for semantic segmentation and also in the medical imaging domain. We explore many components of the model such as different consistency losses and we perform an extensive evaluation and ablation experiments on a realistic small data regime dataset from the magnetic resonance imaging (MRI) domain. We also provide visualizations to get insights on the model dynamics for the unsupervised domain adaptation task.

This paper is organized as follows. In Section 2 we present related work, whereas in Section 3

we give a short formalization of the unsupervised domain adaptation task and its connection with semi-supervised learning. In Section 

4 we detail our method in terms of model architecture and corresponding design decisions. In Section 5 we describe the dataset we use in our experiments and how we performed the data split for the domain adaptation scenario. In Section 6 we provide the experiment results, followed by an ablation study in Section 7. In Section 8 we provide visual insights regarding the model’s adaptation dynamics for multiple domains. Finally, in Section 9 we discuss our findings and the corresponding limitations of this work. In the spirit of open science and reproducibility, we also provide more information regarding data and source-code availability in Section 10.

2 Related Work

Deep learning based methods for segmentation in medical imaging are being vastly explored in recent years [Litjens . (2017)] and may vary in the specifics on how they handle the task. Most of the initial work was focused on patch-based segmentation [Coupé . (2011)]

, preceding the pioneering deep learning models. With the growing interest on deep learning for several computer vision tasks, the first attempts on using Convolutional Neural Networks (CNNs) for image segmentation were based on processing image patches through a sliding window, which yielded segmented patches. Those independent segmented patches were then concatenated for the creation of the final segmented image 

[Lai (2015)]. The main drawbacks of this approach are regarding computational cost – several forward passes for generating the final result – as well as regarding inconsistency in predictions – which can be fixed by overlapping sliding windows.

Even though patch-wise methods are still being researched [Hou . (2016)] and have led to several advances in segmentation [Lai (2015)], the most common deep architecture for segmentation nowadays is the so-called Fully Convolutional Network (FCN) [Long . (2015)]

. This architecture is based solely on convolutional layers with the final result not depending on the use of fully-connected layers. FCNs can provide a fully-segmented image within a single forward step with variable output size depending on the input tensor size. One of the most well-known FCNs for medical imaging is U-net 

[Ronneberger . (2015)]

, which combines convolutional, downsampling, and upsampling operations with skip non-residual connections. We make use of U-Net throughout this work aiming for generalizable conclusions. This architecture is further discussed in Section 


Deep Domain Adaptation (DDA), which is a field unrelated in essence to medical imaging, has been widely studied in the recent years [Wang  Deng (2018)]. We can divide the literature on DDA as follows: (i) methods based on building domain-invariant feature spaces through auto-encoders [Ghifary . (2016)], adversarial training [Ganin . (2016)], GANs [Hoffman . (2017), Sankaranarayanan . (2018)], or disentanglement strategies [Liu . (2018), Cao . (2018)]; (ii) methods based on the analysis of higher-order statistics [Li . (2016), Sun  Saenko (2016)]; (iii) methods based on explicit discrepancy between source and target domains [Tzeng . (2014)]; and (iv) methods based on implicit discrepancy between domains, also known as self-ensembling [French . (2017), Tarvainen  Valpola (2017)].

In [Hoffman . (2017)]

, the authors train GANs with cycle-consistent loss functions 

[Zhu . (2017)] to remap the distribution from the source to the target dataset, thus creating target domain specific features for completing the task. In [Sankaranarayanan . (2018)], GANs are employed as a means of learning aligned embeddings for both domains. Similarly, disentangled representations for each domain have been proposed [Liu . (2018), Cao . (2018)] with the goal of generating a feature space capable of separating domain-dependent and domain-invariant information.

In [Li . (2016)]

, the authors propose changing parameters of the neural network layers for adapting domains by directly computing or optimizing higher-order statistics. More specifically, they propose an alternative for batch normalization called Adaptive Batch Normalization (AdaBN) that computes different statistics for the source and target domains, hence creating domain invariant features that are normalized according to the respective domain. In a similar fashion, Deep CORAL 

[Sun  Saenko (2016)] provides a loss function for minimizing the covariances between target and source domain features.

Discrepancy-based methods pose a different approach to DDA. By directly minimizing the discrepancy between activations from the source and target domain, the network learns to generate reasonable predictions while incorporating information from the target domain. The seminal work of Tzeng et al. [Tzeng . (2014)] directly minimizes the discrepancy between a specific layer with labeled samples from the source set and unlabeled samples from the target set.

Implicit discrepancy-based methods such as self-ensembling [French . (2017)] have become widely used for unsupervised domain adaptation. Self-ensembling is based on the Mean Teacher network [Tarvainen  Valpola (2017)], which was first introduced for semi-supervised learning tasks. Due to the similarity between unsupervised domain adaptation and semi-supervised learning, there are very few adjustments that need to be made to employ the method for the purposes of DDA. Mean Teacher optimizes a task loss and a consistency loss, the latter minimizing the discrepancy between predictions on the source and target dataset. We further detail how Mean Teacher works in Section 4.1.

There are few studies that report on the consequences of domain discrepancy for medical imaging by making use of the unsupervised domain adaptation literature. The work in [AlBadawy . (2018)] discusses the impact of deep learning models across different institutions, showing a statistically-significant performance decrease in cross-institutional train-and-test protocols. A few studies attempt at directly approaching domain adaptation in medical imaging through adversarial training [Kamnitsas . (2017), Chen . (2018), Zhang . (2018), Lafarge . (2017), Javanmardi  Tasdizen (2018), Dou . (2018)], some generating artificial images for leveraging training data [Mahmood . (2018), Madani . (2018)]. Nevertheless, to the best of our knowledge, we are the first to address the problem of domain shift in medical imaging segmentation by extending the unsupervised domain adaptation self-ensembling method for the semantic segmentation task.

3 Semi-Supervised Learning and
Unsupervised Domain Adaptation

A common approach for leveraging training when few labeled examples are available is semi-supervised learning, which is defined as follows: given a labeled dataset with distribution and unlabeled data with distribution

, learn from both available labeled and unlabeled data in order to either improve a supervised learning task (say classification) or an unsupervised learning task (say clustering).

Semi-supervised learning methods tend to perform well when unlabeled data actually come from the same distribution as the labeled data. This allows the learning algorithm to leverage its knowledge using unlabeled data, which usually consists of the majority of examples. As promising as semi-supervised learning is, the assumption that the distribution of unlabeled data is similar to often fails in real-world applications. We refer the reader to a thorough evaluation of semi-supervised learning methods and their downfalls in [Odena . (2018)].

It is very common for models to be applied in scenarios that are significantly different from those in which they were originally trained on. Examples include different weather conditions for outdoor activity recognition, or different cities for training autonomous vehicles. Those changes in scenario shift the data distribution , harming the quality of the predictions in those cases where the model was not properly adapted for the desired condition.

The difference between the distributions from the examples used in training and test sets is called domain shift, which is formally defined as follows. Consider a source dataset with input distribution and a label distribution , as well as a target dataset with input distribution and labels , . Domain adaptation can be addressed via a supervised approach when labeled data from the target domain is available, or via unsupervised learning when only unlabeled data is available for the target domain.

When a method addresses the problem of domain adaptation using unlabeled data for the target domain, which is the most common and useful scenario, the task at hand is called unsupervised domain adaptation. Unsupervised domain adaptation methods assume that both distributions and are available while distribution is available and is not. In other words, only the source dataset provides labeled examples. Hence, the task is to leverage knowledge mainly from the target domain using the unlabeled data available in .

4 Method

This section details the base domain adaptation methods we are using for the task of medical imaging analysis. We further discuss the changes that are needed for allowing unsupervised domain adaptation on segmentation tasks instead of the typical classification scenario. We also detail the most important aspects one has to address regarding the segmentation of medical images.

4.1 Self-Ensembling and Mean Teacher

Self-Ensembling was originally conceived as a viable strategy for generating predictions on unlabeled data [Laine  Aila (2016)]. The predictions on labeled data are combined and leverage the knowledge on unlabeled data, being used as target for semi-supervised learning. The original paper proposes two different models for self-ensembling. The first model, called

model, employs a consistency loss between predictions on the same input. Each input from a batch is passed through a neural network twice with different augmentation parameters, yielding two different predictions. A squared difference between those predictions is minimized alongside the cross entropy for labeled examples. The second model, called Temporal Ensembling, works on the assumption that, as the training progresses, averaging the predictions over time on unlabeled samples may contribute to a better approximation on the correct label. This pseudo label is then considered as target during training. A squared difference between averaged predictions and the current one is minimized alongside the cross entropy for labeled examples. The network performs the exponential moving average (EMA) to update the generated targets every epoch, as follows:


Self-Ensembling was extended for directly combining model weights instead of predictions. This adaptation is thus called Mean Teacher [Tarvainen  Valpola (2017)]. Considering Eq. (1) for updating the target pseudo labels, Mean Teacher updates the model weights at each step generating a somewhat improved model compared to the model without the EMA, a framework which is linked to the Polyak-Ruppert Averaging [Polyak  Juditsky (1992), Ruppert (1988)]. In this scenario, the EMA model was named teacher and the standard model, student. The update function is as follows:



is a hyperparameter that represents the weight that modulates the importance of the current model’s weights with respect to previous models. The best results are found when factor

is increased later on training. This is arguably due to the fact that as the training progresses, it ends up favoring the current model, and so should larger to given more importance to previous models.

Each training step involves a loss component for both labeled and unlabeled data. All samples from a batch are evaluated by both the student and teacher models. Predictions from both models are compared via the consistency loss. The labeled data, however, is also compared to its ground truth, as traditionally performed in segmentation tasks, in what we call the task loss:


where and are the Lagrange multipliers that represents, respectively, the consistency and regularization weights. The hyperparameter was empirically found to improve results when varying through time, given that in the earlier training steps, the network is still generating poor results. The consistency weight follows a sigmoid rampup saturating at a given hyperparameter value.

Mean Teacher also follows the dynamics of model distillation [Hinton . (2015)]. In that scenario, a trained model is used for predicting instances and its output is used as labels for another, smaller model. This is considered a good practice as soft labels tend to better represent the characteristics of the classes (e.g., the representation distance between a Syberian Husky and an Alaskan Malamute should arguably be smaller than the distance between a Syberian Husky and a Persian Cat). Unlike traditional distillation formulations, the Mean Teacher framework also uses the teacher model to generate labels for unlabeled data and represents a model of the same size that is simultaneously updated during training.

The Mean Teacher framework was also extended for unsupervised domain adaptation in [French . (2017)]. Among the proposed changes, the authors modify the data batches so every batch consists of both images from the source and target domain. At each step, the student model evaluates images from the source domain and computes derivatives via a task loss based on the ground truth. The target domain images, which are unlabeled, are used to compute the consistency loss by comparing predictions from both student and teacher models. Differently from its original formulation, the teacher model only has access to unlabeled examples (in this case, examples from the target domain). Each loss function is thus responsible for improving learning at a single domain. The task loss is evaluated by comparing the predictions against the ground truth for the labeled examples (source domain). For the consistency loss, MSE is often used to evaluate the predictions from the student and teacher models for the unlabeled examples (target domain).

4.2 Adapting Mean Teacher for Segmentation Tasks

Both the original and adapted Mean Teacher versions for unsupervised domain adaptation rely on the cross-entropy classification cost. Considering we are not dealing with classification, but with a segmentation task, we need to minimize a different loss function that takes into consideration the specificities of that task. Originally proposed in [Milletari . (2016)], the dice loss generates reliable segmentation predictions due to its insensitivity to class imbalance:


where and are flattened predictions and ground truth values for an instance, respectively. Dice was kept as the task loss for both baseline and adaptation experiments. Note that the dice loss is computed for the entire batch at once, unlike the typical strategy of averaging when using cross-entropy, for instance.

A second problem when training the student and teacher models for segmentation tasks is the inconsistency introduced between training samples of the student and teacher models when a affine transformation such as translation or rotation (or any other spatial-changing transformation) is applied with different parameters to both inputs of the teacher and student models. To solve that problem we used the same approach employed by [Perone  Cohen-Adad (20181)] as shown in the Figure 4. The augmentation in this case, depicted by the transformation, where is the input data and are the transformation parameters (i.e. rotation angle), is applied for the student model before feeding data into the model and for the teacher model it is applied with the same parameters on a delayed fashion on the predictions of the teacher model, causing both predictions to be aligned for the consistency loss. This is possible because the back-propagation takes place only for the student model, therefore there is no need for differentiation on the delayed augmentation of the teacher model. An overview of the proposed method can be seen in Figure 3. Examples of images after data augmentation and their respective compensated ground truth are shown in Figure 5.

Figure 3: Overview of the proposed method. The green panel represents the traditional supervision signal. (1) The source domain input data is augmented by the transformation and feed into the student model. (2) The teacher model parameters is updated with an exponential moving average (EMA) from the student weights. (3) The traditional segmentation loss, where the supervision signal is provided with the source domain labels. (4) The input unlabeled data from the target domain is transformed with before the student model forward pass (note the different parametrization ). (5) The teacher model predictions are augmented by the transformation (with same parametrization from Step 4). (6) The consistency loss. This consistency will enforce the consistency between student predictions according to the teacher predictions.
Figure 4: Data augmentation scheme used to overcome the spatial misalignment between student and teacher model predictions. The same augmentation parameters are used for the input data for the student model and on the teacher model predictions.
Figure 5: MRI axial-slices samples from the SCGM Segmentation Challenge [Prados . (2017)]. The ground truth is shown in green. Best viewed in color.

4.3 Model architecture employed

A U-net [Ronneberger . (2015)] model architecture with 15 layers, group normalization [Wu  He (2018)] (discussed later) and dropout was employed for all experiments. Since U-net is widely applied in biomedical, we believe that our results would generalize to a wide spectrum of applications.

To produce a fair comparison, we followed the recommendations from [Oliver . (2018)] and kept the same model for the baseline and for our method, thus avoiding conflating comparisons. Even though the mean teacher method usually acts also as a regularization of the model, we still kept the same regularization weights for all comparisons, however it is important to note that the regularization can be adjusted and thus improve the results of the mean teacher even further.

4.4 Baseline employed

We conducted a hyperparameter search to find a good baseline model. This search yield the parameters mini-batch size of 12 and dropout rate of 0.5. For training we used Adam optimizer [Kingma  Ba (2015)] with penalty factor of and and . As for the learning rate, we used a sigmoid learning rate ramp-up until epoch 50 followed by a cosine ramp-down until epoch 350. The Equation 5 shows the sigmoid ramp-up formula:


where is the highest learning rate and represents the ratio between the current epoch and the total ramp-up epochs. Equation 6 presents the cosine ramp-down:


where is the highest learning rate and is the ratio between the number of epochs after the ramp-up procedure and the total number of epochs expected for the training.

For a fair comparison, no hyper parameter from the baseline model is changed in the adaptation scenario. The only changes done are in parameters that affect only the domain adaptation training aspects. This leads to easier comparison and realistic evaluation on the benefits of using domain adaptation.

4.5 Consistency loss

The consistency loss is one of the most important aspects of the Mean teacher. If the measured difference between the predictions from the teacher and student isn’t representative for distilling the knowledge on the student model, the method will not work properly or even diverge the training. The originally proposed function for the Mean teacher is the mean-squared error.


where and are flattened predictions from the student and teacher, respectively.

Cross entropy is another important function that we investigated. It is more commonly used as the main task loss of classification tasks but as it aims at matching distributions, it could potentially improve the gradients for the student:


where and are predictions from the student and teacher, respectively.

We expanded the study for multiple loss functions in our scenario and we found several other losses that could potentially be used as consistency losses.

Our initial experiments lead to weighted variants of MSE, since it could improve the class imbalance problem. However, this approach relies on thresholding predictions from the teacher as to define binary expected voxel values for the student. We found that defining both the correct weights and the threshold value was difficult and did not improve results.

The same problem persists with more complex losses such as the Focal Loss [Lin . (2018)] due to additional hyperparameters (in this case and ).

We also tried two additional losses: the Dice Loss, already presented in section 4 and the Tversky Loss [Salehi . (2017)]. The Tversky Loss is a variation of Dice that aims at mitigating the problem of class imbalance, common in medical imaging segmentation tasks.


where and represent the predicted probabilities of a voxel belonging to the spinal cord, and the predicted probabilities of a voxel belonging to any other tissue. There are also additional and hyperparameters that adjust for the class imbalance. This also leads to the same problem of correctly setting more hyperparameters alongside the consistency weight value.

We later identified that both Dice and Tversky have a problem when being used as consistency losses. Although they best represent the nature of the task and thus could potentially improve predictions, the values from the loss had an undesirable behavior. This is due to the equation design, as their main operation is based on multiplication but is expected to belong to . Both Dice and Tversky tend to work when ground truth labels are binary, but in this case which we have soft labels from the teacher, even when both the teacher and the student identically match the output, the loss will not be minimum.

Intuitively, given , . For example, when and , the numerator result is . Contrastively, when and , the score should increase, but instead goes to . This issue grows for increasingly lower values of and .

One way to surpass this issue is to threshold teacher predictions. This generates hard labels and the equation should work properly. However, we struggled to identify what a suitable threshold value should be as it drastically impacts how the network adapts and it somewhat diminishes the benefits of using a distillation-based approach. An alternative would be to accommodate the equation to properly handle soft labels (values between 0 and 1). This should maintain the results from the original equation when it works properly and provide sound values when the original fails. We leave such alternative equations for future work.

4.6 Batch normalization and group normalization in the domain adaptation scenario

Batch Normalization [Ioffe  Szegedy (2015)] is a method that aims to improve the training of deep neural networks through the stabilization of the distribution of layer inputs. Batch Normalization is nowadays pervasive in most of deep learning architectures, allowing the use of large learning rates and helping with the convergence of deep networks.

Initially thought to help with the internal covariate shift (ICS)  [Ioffe  Szegedy (2015)], Batch Normalization was recently found [Santurkar . (2018)] to smooth the optimization landscape of the network, manifested by its improvement of the Lipschitzness, or -smoothness  [Santurkar . (2018)], of both the loss and gradients.

Batch Normalization works differently for training and inference. During the training time, the normalization happens using the batch statistics while on inference time, it uses the population statistics, usually estimated with moving averages on each batch during the training procedure. This procedure, however, is problematic for Domain Adaptation using Mean Teacher given that there are multiple distributions being fed during training time, causing the Batch Normalization statistics to be computed using both source and target domain data.

One approach that can be used to solve that issue is to use different batch statistics for the source and target domain as in AdaBN [Li . (2016)]. The implementation of this approach using modern frameworks are easy during training time because all it requires is to forward the batch for each domain separately as done in [French . (2017)], however, this approach will still use both source and target domain data to compute the running average used during inference. One must keep separate running means as well for the estimation of the population statistics, however this increases the complexity of the training, especially when training on a multi-GPU scenario with small batch size, which are very common in segmentation tasks, where synchronization is essential.

Besides the mentioned issues, Batch Normalization also suffers from sub-optimal results when using small batch sizes [Wu  He (2018)], which are very common on segmentation tasks due to memory requirements. For these reasons, we used Group Normalization [Wu  He (2018)]

, an alternative to Batch Normalization that divides the channels into groups and computes mean and variance within each group independently of the batch sizes. Group Normalization works consistently better than Batch Normalization with small batch sizes and doesn’t keep running averages for the population statistics, simplifying the training and inference procedures and providing better results for our scenario involving domain adaptation and segmentation tasks.

4.7 Hyperparameters for unsupervised domain adaptation

A problem that is usually faced by many techniques for unsupervised domain adaptation is the hyperparameter selection (such as the learning rate, consistency weight, etc). On a unsupervised setting, there are no labeled data from the target domain, so the estimation of parameters from the source distribution alone can be completely different from the parameters estimated from the target distribution.

An alternative method to solve this issue is to use reverse cross-validation [Zhong . (2010)] as used also by [Ganin . (2016)]. However, this technique adds more complexity to the validation process. We found, however, that the estimation of hyperparameters of the mean teacher approach on the source domain yielded robust results, however this is a limitation of our evaluation as one could achieve better results for our our proposed method by incorporating a better hyperparameter estimation procedure.

5 Materials

This section describe the datasets used in this work.

5.1 Spinal Cord Gray Matter Challenge dataset

The Spinal Cord Gray Matter Challenge [Prados . (2017)] dataset is a multi-center, multi-vendor and publicly-available MRI data collection that is comprised by 80 healthy subjects with 20 subjects from each center.

The demographics of the dataset ranges from a mean age of 28.3 up to 44.3 years old. Three different MRI systems were employed (Philips Achieva, Siemens Trio, Siemens Skyra) with different acquisition parameters. The voxel size resolution of the dataset ranges from 0.25x0.25x2.5 mm up to 0.5x0.5x5.0 mm. The dataset is split between training (40) and test (40) with the test set labels hidden (not publicly available). For each labeled slice in the dataset, 4 gold standard segmentation masks were manually created by 4 independent expert raters (one per participating center). For more information regarding dataset, such as the MRI parameters, please refer to the work from [Prados . (2017)].

Due to the fact that the Spinal Cord Gray Matter Challenge dataset contains data from all 4 centers both in training as well as in the test, we used a non-standard split of the data in order to evaluate our technique on a domain adaptation scenario where the domain present in the test set didn’t contain contamination from the training data domain. Therefore, we used the centers 1 and 2 as the training set, the center 3 as the validation set and the center 4 as the test set.

We used the unlabeled data from the challenge center 4 test set (that doesn’t originally contains publicly-available labels) as the unlabeled data for the target domain and used the center 4 training data (with labels) as the test set to evaluate the final performance of our model. We also resampled all samples of the dataset to a common space of 0.25x0.25 mm.

An overview of the dataset split is graphically shown in Figure 6.

Figure 6: An overview of the dataset split used for the training procedure. Each colored square represent a single subject of the dataset (containing multiple axial slices).

6 Experiments

We created several experiments to understand the behavior of different aspects of domain adaptation on the medical imaging domain. We also did ablation studies and evaluated multiple metrics for each center.

6.1 Adapting on different centers

We maintained training at both centers 1 and 2 in a supervised manner. We them adapt the network at centers 3 and 4 separately. This way, we can observe three related but independent aspects of adaptation and semi-supervised learning.

  1. How the network changes its predictions on images from the source domain as images from different domains are presented;

  2. How the network adapts its predictions for the adapted domain after the domain adaptation;

  3. How an adapted network generalizes when presented with images that weren’t used during the training neither as supervised signal nor as a unsupervised adaptation component.

The results are presented in Table 1. We now aim at better understand the behavior by individually answering the proposed questions.

Evaluation Adaptation Dice mIoU Recall Precision Specificity Hausdorff
Center 3 Baseline 82.81 0.33 71.05 0.36 90.61 0.63 77.09 0.34 99.86 0.0 2.14 0.02
Center 3 84.72 0.18 73.67 0.28 87.43 1.90 83.17 1.62 99.91 0.01 2.01 0.03
Center 4 84.45 0.14 73.30 0.19 87.13 1.77 82.92 1.76 99.91 0.01 2.02 0.03
Center 4 Baseline 69.41 0.27 53.89 0.31 97.22 0.11 54.95 0.35 99.70 0.00 2.50 0.01
Center 3 73.27 1.29 58.50 1.57 94.92 1.48 60.93 2.51 99.77 0.03 2.36 0.06
Center 4 74.67 1.03 60.22 1.24 93.33 1.96 63.62 2.42 99.80 0.02 2.29 0.05
Table 1:

Evaluation results on different centers. The evaluation and adaptation columns represent, respectively, the centers where testing and adaptation data were collected. The numeric results show the mean and standard deviation over 10 runs (with independent random weights initializations). Highlighted values represent the best at each center. All experiments were trained in both centers 1 and 2 simultaneously. Dice represents the Sørensen–Dice coefficient, mIoU represents the mean Intersection over Union. Other metrics are self-explanatory.

For Question 1 we can observe the evaluation on centers 1 and 2. Both centers are included in the training set and we want to observe whether additional unsupervised data from different domains (centers 3 or 4) could improve generalization on the original centers (1 and 2). At both adapted centers (3 and 4), the results in almost all metrics are superior against the baseline, except for recall, meaning that there is a positive change in prediction performance for the source domain after the domain adaptation on different domains with unlabelled data.

Question 2 can be investigated by interpreting results from the rows with evaluation center and adaptation center both on center 3, and evaluation center and adaptation center both on center 4. Both rows present the highest values in almost every metric, except for recall. This means that domain adaptation is working properly for the given scenario.

Question 3 can also be inferred from the results. By observing evaluation center 3 with adaptation on center 4, and evaluation center 4 with adaptation on center 3, we perceive gains against the baseline in almost all metrics. We can infer that domain adaptation in this case is helping generalization for unseen centers.

6.2 Different consistency losses

We execute multiple runs of the Mean teacher algorithm to determine which consistency losses work best. We focused just on losses that did not contain any additional hyperparameters. For example, the Tversky Loss [Salehi . (2017)] presents a very similar approach to the Dice Loss, but with two additional hyperparameters ( and ). Thus, we removed it to keep fair comparisons among several loss function, as fair comparison would take much higher computational time for combining both hyperparameters and consistency weight values.

These constraints limit us to cross entropy, mean-squared error, and dice loss, already described in 4. We believe, however, that this is an important aspect of proper domain adaptation and thus leave further investigation of other losses for future work.

6.3 Behavior of Dice predictions and thresholding

It is known that networks training with the Dice loss usually produces predictions where their distribution are concentrated on the upper and lower bounds of the probability distribution, with very low entropy. As in [Perone  Cohen-Adad (20182)], we also used a high threshold value of 0.99 for the Dice predictions in order to produce a balanced model.

We found, however, that our domain adaptation method also regularizes the network predictions, shifting the Dice probability distribution out from the probability bounds. For that reason with fixed and used a lower Dice prediction threshold value of 0.9 instead of 0.99, which produced a more balanced model in terms of precision and recall.

6.4 Training stability

For the task of unsupervised domain adaptation, it is important that the training becomes somewhat stable. As in the most difficult scenarios there are no annotations for validating the adaptation, the unstable training might produce sub-optimal adaptation results.

To evaluate the training stability, we ran different consistency weights for each possible consistency loss and evaluate the difference between the best values found and the final results after 350 epochs. Table 2 summarizes the results from these experiments.

We can observe that cross entropy consistently fails for different weight values, but also achieve high dice values in its best scenario during training. Cross entropy becomes then a possible alternative to MSE when a few annotated images for validation are available in the target domain. Figure 7 shows how the training diverges for cross entropy loss after several iterations.

We can observe that both Dice and cross entropy have trouble stabilizing the training after achieving high results. However, MSE tends to be more invariant to consistency weight, thus being a robust approach when no annotated data is available at the target center.

As in [French . (2017)], we also experimented with confidence thresholding, however we found no improvements by doing so.

Figure 7: Validation results at each epoch for the Teacher Model at Center 3 with Cross Entropy as the consistency loss function. The training was conducted in both centers 1 and 2 simultaneously and adapted at 3 with consistency weight . Best viewed in color.
Loss Weight Dice mIoU Recall Precision Specificity Hausdorff
CE 5 0.00 (85.50) 0.00 (74.91) 0.00 (95.01) 0.00 (98.90) 100.0 (100.00) 0.00 (0.00)
10 0.00 (80.73) 0.00 (69.54) 0.00 (83.21) 0.00 (98.78) 100.0 (100.00) 0.00 (0.00)
15 6.43 (37.03) 4.89 (26.06) 5.38 (77.05) 17.34 (65.85) 100.0 (100.00) 0.28 (0.00)
20 2.30 (67.61) 1.86 (52.55) 2.09 (65.00) 7.94 (96.57) 100.0 (100.00) 0.12 (0.03)
Dice 5 76.76 (80.74) 62.76 (68.16) 97.88 (99.66) 63.72 (72.50) 99.71 (99.81) 2.36 (2.16)
10 4.77 (10.55) 2.45 (5.64) 96.25 (99.99) 2.45 (5.85) 79.59 (99.75) 8.80 (2.57)
15 2.30 (7.74) 1.16 (4.12) 99.95 (100.00) 1.16 (4.62) 55.07 (99.80) 11.75 (2.50)
20 1.79 (4.43) 0.90 (2.27) 99.99 (100.00) 0.90 (2.30) 42.02 (99.84) 12.68 (2.43)
MSE 5 83.7 (83.88) 72.2 (72.46) 91.24 (98.19) 78.1 (78.57) 99.87 (99.93) 2.1 (2.00)
10 84.38 (84.38) 73.19 (73.19) 90.15 (99.07) 80.12 (80.12) 99.88 (99.94) 2.05 (1.89)
15 84.59 (84.59) 73.49 (73.50) 89.19 (98.52) 81.28 (81.28) 99.89 (99.89) 2.03 (2.03)
20 84.5 (84.50) 73.36 (73.37) 90.36 (94.63) 80.16 (80.16) 99.88 (99.98) 2.05 (1.46)
Table 2: Results on evaluation of center 3. All experiments were trained in both center 1 and 2 simultaneously with unsupervised adaptation for center 3. Values inside parenthesis represent the best validation result at each metric achieved during training. The remaining values represent the final result after 350 epochs.
Evaluation Version Dice mIoU Recall Precision Specificity Hausdorff
Center 3 Baseline 83.06 71.36 90.98 77.24 99.86 2.13
EMA 83.09 71.40 90.97 77.30 99.86 2.13
Center 4 Baseline 69.41 53.90 97.20 54.98 99.70 2.48
EMA 69.50 54.00 97.19 55.09 99.71 2.48
Table 3: Results on evaluation on different centers. Each evaluation center represents from which center the testing data was collected. In this scenario a model was trained and compared against its Polyak-Averaged model (EMA). All experiments were trained in both center 1 and 2 simultaneously.

7 Ablation Experiments

This section describes the ablation experiments performed to rule out evidence of external improvement factors.

7.1 Exponential moving average (EMA)

The improvement saw on Table 1 can also be explained by introduction of the exponential moving average (EMA) during the training procedure, by averaging and smoothing the SGD trajectory. However, we want to ground the evidence that the improvement is from unlabeled data and not only from the exponential average component. Therefore, we performed an ablation experiment by leaving the EMA active and setting the consistency weight to zero, evaluating therefore the impact of the exponential average but without taking into consideration the unlabeled data used to enforce consistency.

We executed the same experimental setting from the Table 1 but with the consistency weight set to zero. The results are presented in Table 3 and show that the EMA model (teacher) presents no gains over the non-averaged model (the supervised baseline). This is arguably due to a poorly chosen , however, the Mean teacher, which heavily relies on the EMA model, was able to outperform purely supervised methods with great margin.

8 Visualizing domain shift

We investigate how domain adaptation affects the prediction space of segmentation at different centers. By using the t-SNE [Maaten  Hinton (2008)] method we are able to see the changes on the network’s perception of unsupervised data. Data shown in the following charts wasn’t presented during training.

We ran two baselines for this experiment. The first model was trained in a supervised manner following the same hyper parameters presented in Section 4.4

. The other is an adaptation scenario where both centers 1 and 2 were used as supervised centers and 3 as adaptation target. The vectors projected with t-SNE represent the features from the network before the final sigmoid activation. This leads to an easier separation due to predictions not being squashed between 0 and 1. Furthermore, projecting the predictions leads to a somewhat simple interpretation on how the network’s output is dealing with unseen or unsupervised data.

Both t-SNE executions had learning rate set to 10, perplexity to 30 and ran for about 1000 iterations111We used the TensorBoard embedding projector, available at https://github.com/tensorflow/tensorboard . We noted that more iterations preserved the groups structure but further compressed them. This made visualizing the centers hard, so we chose 1000 as a good trade-off between identifying emerging groups and interpretability.

The results from the supervised experiment are shown in Figure 7(a). As one can observe, there is a big separation between data from centers presented during training (1 and 2) and centers unseen (3 and 4). This shows that the network predictions differ severely according to the center to which the instance belongs.

(a) t-SNE algorithm execution on supervised learning scenario.
(b) t-SNE algorithm execution on domain adaptation scenario.
Figure 8: Execution of t-SNE algorithm for two different scenarios. Best viewed in color.

When adapting the network with unlabelled examples from a different domain, the predictions become more diffuse, at least for centers presented during training. The results from the unsupervised adaptation experiment are shown in Figure 7(b). In this scenario, centers presented with labeled data (centers 1 and 2) form clusters with domains seen only in an unsupervised fashion (3) or not presented to the network for training in any way (4). A possible explanation for the clusters becoming between Centers 1 and 4, and Center 2 and 3 is due to how close their voxel intensity distribution is. The original voxel distribution can be seen in Figure 2.

We further address this possibility in Figure 9. It can be perceived that points in prediction space tend to get closer by their pixel distribution. This affirmation does not maintain in the supervised version presented in Figure 7(a). We believe that the reason is that by presenting images, even in an unsupervised fashion, the network was able to better map the data manifold. This can lead to better generalization and thus more controlled and trustworthy predictions in unseen centers. It also brings room for exploring error estimation based directly on the input distribution. We leave such possibilities for future work.

Figure 9: Expanded visualization for t-SNE from the adaptation scenario in Figure 7(b). The chart in the middle represents the pixel distribution from each center. It can be observed how similar distributions tend to form clusters on the prediction space.

9 Conclusion and limitations

Variability in many medical imaging modalities is still a big problem for machine learning methods. The different parametrizations that can be used to acquire imaging data and the lack of standardized protocols and industry standards are pervasive across the entire field.

In this work, we showed that unsupervised domain adaptation can indeed be an opportunity to increase the performance of these models for medical imaging at multiple centers without depending on annotations, an expensive resource to acquire in medical imaging.

Both through the evaluation of multiple metrics and data analysis we showed how self-ensembling methods can improve generalization on unseen domains through the leverage of unlabeled data from different domains. On the ablation experiment, we were able to observe how Mean teacher is able to take leverage of the unlabeled data even without an exhaustive search on the hyperparameter space.

We also observed how cross entropy loss failed to maintain training stability by diverging the training as the number of epochs increased when used as consistency loss function. We discussed how this can lead to potential problems in more challenging scenarios for multiple centers. We also showed the issues that arise when using the Dice loss as the consistency loss.

We are also aware of the limitations of the present study as we do not evaluate adversarial training methods for domain adaptation. Even considering that the Mean teacher is currently the state-of-the-art method on many datasets, we believe that further evaluations of it on the same realistic small data regime could further increase the significance of our contributions and thus we leave this for future work.

Another limitation is the single task evaluation of the gray matter segmentation, which can be expanded to other tasks in other domains. Increasing the number of centers alongside the number of tasks to show whether the pattern found in this study persists in other scenarios is an important missing contribution.

Further work on the field could pave the way for methods that can measure the risk of adaptation to particular centers or domains. This is definitely an important step towards understanding the limitations of the domain adaptation methods. We believe that the problems that arise from the variability of medical imaging modalities require rethinking the strong assumptions that machine learning methods assume, however this nevertheless poses complex difficulties due to its foundational nature.

One important step in the direction of assessing the current methods limitations, is to reinforce the importance of proper multi-domain evaluation in studies and medical imaging challenges, that rarely provide a test set from different domains and containing the realistic variability found in real scenarios.

10 Source-code and dataset availability

In the spirit of Open Science and reproducibility, the source-code used to reproduce the experiments and to replicated the results of this work can be found at our repository222https://github.com/neuropoly/domainadaptation.

The dataset used for this work is also available on the Spinal Cord Gray Matter Segmentation Challenge website upon a data license agreement333http://cmictig.cs.ucl.ac.uk/niftyweb/program.php?p=CHALLENGE.

11 Acknowledgments

Funded by the Canada Research Chair in Quantitative Magnetic Resonance Imaging [950-230815], the Canadian Institute of Health Research [CIHR FDN-143263], the Canada Foundation for Innovation [32454, 34824], the Fonds de Recherche du Québec - Santé [28826], the Fonds de Recherche du Québec - Nature et Technologies [2015-PR-182754], the Natural Sciences and Engineering Research Council of Canada [435897-2013], the Canada First Research Excellence Fund (IVADO and TransMedTech) and the Quebec BioImaging Network [5886]. This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nivel Superior – Brasil (CAPES) – Finance Code 001.


  • AlBadawy . (2018) albadawy2018deepAlBadawy, EA., Saha, A.  Mazurowski, MA.  2018. Deep learning for segmentation of brain tumors: Impact of cross-institutional training and testing Deep learning for segmentation of brain tumors: Impact of cross-institutional training and testing. Medical physics4531150–1158.
  • Cao . (2018) cao2018didaCao, J., Katzir, O., Jiang, P., Lischinski, D., Cohen-Or, D., Tu, C.  Li, Y.  2018. DiDA: Disentangled Synthesis for Domain Adaptation Dida: Disentangled synthesis for domain adaptation. arXiv preprint arXiv:1805.08019.
  • Chen . (2018) Chen2018Chen, C., Dou, Q., Chen, H.  Heng, PA.  2018. Semantic-Aware Generative Adversarial Nets for Unsupervised Domain Adaptation in Chest X-ray Segmentation Semantic-Aware Generative Adversarial Nets for Unsupervised Domain Adaptation in Chest X-ray Segmentation.
  • Coupé . (2011) coupe2011patchCoupé, P., Manjón, JV., Fonov, V., Pruessner, J., Robles, M.  Collins, DL.  2011. Patch-based segmentation using expert priors: Application to hippocampus and ventricle segmentation Patch-based segmentation using expert priors: Application to hippocampus and ventricle segmentation. NeuroImage542940–954.
  • Dou . (2018) Dou2018Dou, Q., Ouyang, C., Chen, C., Chen, H.  Heng, PA.  2018. Unsupervised Cross-Modality Domain Adaptation of ConvNets for Biomedical Image Segmentations with Adversarial Loss Unsupervised Cross-Modality Domain Adaptation of ConvNets for Biomedical Image Segmentations with Adversarial Loss .
  • French . (2017) french2017selfFrench, G., Mackiewicz, M.  Fisher, M.  2017. Self-ensembling for visual domain adaptation Self-ensembling for visual domain adaptation. arXiv preprint arXiv:1706.05208.
  • Ganin . (2016) ganin2016domainGanin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F.Lempitsky, V.  2016. Domain-adversarial training of neural networks Domain-adversarial training of neural networks. The Journal of Machine Learning Research1712096–2030.
  • Ghifary . (2016) ghifary2016deepGhifary, M., Kleijn, WB., Zhang, M., Balduzzi, D.  Li, W.  2016. Deep reconstruction-classification networks for unsupervised domain adaptation Deep reconstruction-classification networks for unsupervised domain adaptation. European Conference on Computer Vision European conference on computer vision ( 597–613).
  • Gros . (2018) Gros2018Gros, C., De Leener, B., Badji, A., Maranzano, J., Eden, D., Dupont, SM.Cohen-Adad, J.  2018may. Automatic segmentation of the spinal cord and intramedullary multiple sclerosis lesions with convolutional neural networks Automatic segmentation of the spinal cord and intramedullary multiple sclerosis lesions with convolutional neural networks.
  • He . (2016) He2015bHe, K., Zhang, X., Ren, S.  Sun, J.  2016.

    Delving deep into rectifiers: Surpassing human-level performance on imagenet classification Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.

    Proceedings of the IEEE International Conference on Computer Vision11-18-Dece1026–1034. 10.1109/ICCV.2015.123
  • Hinton . (2015) hinton2015distillingHinton, G., Vinyals, O.  Dean, J.  2015. Distilling the knowledge in a neural network Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  • Hoffman . (2017) hoffman2017cycadaHoffman, J., Tzeng, E., Park, T., Zhu, JY., Isola, P., Saenko, K.Darrell, T.  2017. Cycada: Cycle-consistent adversarial domain adaptation Cycada: Cycle-consistent adversarial domain adaptation. arXiv preprint arXiv:1711.03213.
  • Hou . (2016) hou2016patchHou, L., Samaras, D., Kurc, TM., Gao, Y., Davis, JE.  Saltz, JH.  2016. Patch-based convolutional neural network for whole slide tissue image classification Patch-based convolutional neural network for whole slide tissue image classification.

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Proceedings of the ieee conference on computer vision and pattern recognition ( 2424–2433).

  • Ioffe  Szegedy (2015) Ioffe2015Ioffe, S.  Szegedy, C.  2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37 Proceedings of the 32nd international conference on international conference on machine learning - volume 37 ( 448–456). JMLR.org. 10.1007/s13398-014-0173-7.2
  • Javanmardi  Tasdizen (2018) Javanmardi2018Javanmardi, M.  Tasdizen, T.  2018. DOMAIN ADAPTATION FOR BIOMEDICAL IMAGE SEGMENTATION USING ADVERSARIAL TRAINING Scientific Computing and Imaging Institute , University of Utah DOMAIN ADAPTATION FOR BIOMEDICAL IMAGE SEGMENTATION USING ADVERSARIAL TRAINING Scientific Computing and Imaging Institute , University of Utah. Isbi554–558.
  • Kamnitsas . (2017) kamnitsas2017unsupervisedKamnitsas, K., Baumgartner, C., Ledig, C., Newcombe, V., Simpson, J., Kane, A.others  2017. Unsupervised domain adaptation in brain lesion segmentation with adversarial networks Unsupervised domain adaptation in brain lesion segmentation with adversarial networks. International Conference on Information Processing in Medical Imaging International conference on information processing in medical imaging ( 597–609).
  • Kingma  Ba (2015) Kingma2015aKingma, DP.  Ba, JL.  2015. Adam: a Method for Stochastic Optimization Adam: a Method for Stochastic Optimization. International Conference on Learning Representations 20151–15. http://doi.acm.org.ezproxy.lib.ucf.edu/10.1145/1830483.1830503
  • Lafarge . (2017) Lafarge2017Lafarge, MW., Pluim, JP., Eppenhof, KA., Moeskops, P.  Veta, M.  2017. Domain-adversarial neural networks to address the appearance variability of histopathology images Domain-adversarial neural networks to address the appearance variability of histopathology images.

    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)10553 LNCS83–91.

  • Lai (2015) lai2015deepLai, M.  2015. Deep learning for medical image segmentation Deep learning for medical image segmentation. arXiv preprint arXiv:1505.02000.
  • Laine  Aila (2016) laine2016temporalLaine, S.  Aila, T.  2016. Temporal ensembling for semi-supervised learning Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242.
  • LeCun . (2015) LeCun2015aLeCun, Y., Bengio, Y., Hinton, G., Y., L., Y., B.  G., H.  2015. Deep learning Deep learning. Nature5217553436–444. 10.1038/nature14539
  • Li . (2016) li2016revisitingLi, Y., Wang, N., Shi, J., Liu, J.  Hou, X.  2016. Revisiting batch normalization for practical domain adaptation Revisiting batch normalization for practical domain adaptation. arXiv preprint arXiv:1603.04779.
  • Lin . (2018) lin2018focalLin, TY., Goyal, P., Girshick, R., He, K.  Dollár, P.  2018. Focal loss for dense object detection Focal loss for dense object detection. IEEE transactions on pattern analysis and machine intelligence.
  • Litjens . (2017) Litjens2017Litjens, G., Kooi, T., Bejnordi, BE., Setio, AAA., Ciompi, F., Ghafoorian, M.Sánchez, CI.  2017. A survey on deep learning in medical image analysis A survey on deep learning in medical image analysis. Medical Image Analysis4260–88. 10.1016/j.media.2017.07.005
  • Liu . (2018) Liu2018Liu, YC., Yeh, YY., Fu, TC., Wang, SD., Chiu, WC.  Wang, YCF.  2018. Detach and Adapt: Learning Cross-Domain Disentangled Deep Representation Detach and Adapt: Learning Cross-Domain Disentangled Deep Representation. Proceedings - 31th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018.
  • Long . (2015) long2015fullyLong, J., Shelhamer, E.  Darrell, T.  2015. Fully convolutional networks for semantic segmentation Fully convolutional networks for semantic segmentation. Proceedings of the IEEE conference on computer vision and pattern recognition Proceedings of the ieee conference on computer vision and pattern recognition ( 3431–3440).
  • Maaten  Hinton (2008) maaten2008visualizingMaaten, Lvd.  Hinton, G.  2008. Visualizing data using t-SNE Visualizing data using t-sne. Journal of machine learning research9Nov2579–2605.
  • Madani . (2018) Madani2018Madani, A., Moradi, M., Karargyris, A.  Syeda-Mahmood, T.  2018. Semi-supervised learning with generative adversarial networks for chest x-ray classification with ability of data domain adaptation Semi-supervised learning with generative adversarial networks for chest x-ray classification with ability of data domain adaptation. IEEE 15th Symposium on Biomedical ImagingIsbi1038–1042. 10.1109/ISBI.2018.8363749
  • Mahmood . (2018) Mahmood2018Mahmood, F., Chen, R.  Durr, NJ.  2018. Unsupervised Reverse Domain Adaptation for Synthetic Medical Images via Adversarial Training Unsupervised Reverse Domain Adaptation for Synthetic Medical Images via Adversarial Training. IEEE Transactions on Medical ImagingPPc1. 10.1109/TMI.2018.2842767
  • Milletari . (2016) milletari2016vMilletari, F., Navab, N.  Ahmadi, SA.  2016. V-net: Fully convolutional neural networks for volumetric medical image segmentation V-net: Fully convolutional neural networks for volumetric medical image segmentation. 3D Vision (3DV), 2016 Fourth International Conference on 3d vision (3dv), 2016 fourth international conference on ( 565–571).
  • Neyshabur . (2017) neyshabur2017exploringNeyshabur, B., Bhojanapalli, S., McAllester, D.  Srebro, N.  2017. Exploring generalization in deep learning Exploring generalization in deep learning. Advances in Neural Information Processing Systems Advances in neural information processing systems ( 5947–5956).
  • Odena . (2018) odena2018realisticOdena, A., Oliver, A., Raffel, C., Cubuk, ED.  Goodfellow, I.  2018. Realistic Evaluation of Semi-Supervised Learning Algorithms Realistic evaluation of semi-supervised learning algorithms.
  • Oliver . (2018) Oliver2018Oliver, AGB., Odena, AGB., Raffel, CGB., Cubuk, EGB.  Goodfellow, IJGB.  2018. Realistic Evaluation of semi-supervised learning algortihms Realistic evaluation of semi-supervised learning algortihms. International conference on Learning Representations1–15.
  • Perone  Cohen-Adad (20181) Perone2018aPerone, CS.  Cohen-Adad, J.  20181sep. Deep semi-supervised segmentation with weight-averaged consistency targets Deep semi-supervised segmentation with weight-averaged consistency targets. DLMIA MICCAI1–8. 10.1007/978-3-030-00889-5
  • Perone  Cohen-Adad (20182) Perone2018Perone, CS.  Cohen-Adad, J.  20182. Spinal cord gray matter segmentation using deep dilated convolutions Spinal cord gray matter segmentation using deep dilated convolutions. Nature Scientific Reports81. 10.1038/s41598-018-24304-3
  • Polyak  Juditsky (1992) polyak1992accelerationPolyak, BT.  Juditsky, AB.  1992. Acceleration of stochastic approximation by averaging Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization304838–855.
  • Prados . (2017) Prados2017Prados, F., Ashburner, J., Blaiotta, C., Brosch, T., Carballido-Gamio, J., Cardoso, MJ.Cohen-Adad, J.  2017. Spinal cord grey matter segmentation challenge Spinal cord grey matter segmentation challenge. NeuroImage152312–329. 10.1016/j.neuroimage.2017.03.010
  • Rajpurkar . (2017) Rajpurkar2017Rajpurkar, P., Hannun, AY., Haghpanahi, M., Bourn, C.  Ng, AY.  2017. Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks. arXiv preprint.
  • Ronneberger . (2015) Ronneberger2015Ronneberger, O., Fischer, P.  Brox, T.  2015. U-Net: Convolutional Networks for Biomedical Image Segmentation U-Net: Convolutional Networks for Biomedical Image Segmentation. 1–8. 10.1007/978-3-319-24574-4_28
  • Ruppert (1988) ruppert1988efficientRuppert, D.  1988. Efficient estimations from a slowly convergent Robbins-Monro process Efficient estimations from a slowly convergent robbins-monro process . Cornell University Operations Research and Industrial Engineering.
  • Salehi . (2017) salehi2017tverskySalehi, SSM., Erdogmus, D.  Gholipour, A.  2017. Tversky loss function for image segmentation using 3D fully convolutional deep networks Tversky loss function for image segmentation using 3d fully convolutional deep networks. International Workshop on Machine Learning in Medical Imaging International workshop on machine learning in medical imaging ( 379–387).
  • Sankaranarayanan . (2018) Sankaranarayanan2018Sankaranarayanan, S., Balaji, Y., Castillo, CD.  Chellappa, R.  2018. Generate To Adapt: Aligning Domains using Generative Adversarial Networks Generate To Adapt: Aligning Domains using Generative Adversarial Networks. Proceedings - 31th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018. 10.1109/CVPR.2017.316
  • Santurkar . (2018) Santurkar2018Santurkar, S., Tsipras, D., Ilyas, A.  Madry, A.  2018. How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift).
  • Sun  Saenko (2016) sun2016deepSun, B.  Saenko, K.  2016. Deep coral: Correlation alignment for deep domain adaptation Deep coral: Correlation alignment for deep domain adaptation. European Conference on Computer Vision European conference on computer vision ( 443–450).
  • Tarvainen  Valpola (2017) tarvainen2017meanTarvainen, A.  Valpola, H.  2017. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems Advances in neural information processing systems ( 1195–1204).
  • Tzeng . (2014) tzeng2014deepTzeng, E., Hoffman, J., Zhang, N., Saenko, K.  Darrell, T.  2014. Deep domain confusion: Maximizing for domain invariance Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474.
  • Wang  Deng (2018) wang2018deepWang, M.  Deng, W.  2018. Deep Visual Domain Adaptation: A Survey Deep visual domain adaptation: A survey. Neurocomputing.
  • Wu  He (2018) Wu2018Wu, Y.  He, K.  2018. Group Normalization Group Normalization.
  • Yosinski . (2014) Yosinski2014Yosinski, J., Clune, J., Bengio, Y.  Lipson, H.  2014. How transferable are features in deep neural networks? How transferable are features in deep neural networks? Advances in Neural Information Processing Systems 27 (Proceedings of NIPS)271–9.
  • Zamir . (2018) Zamir_2018_CVPRZamir, AR., Sax, A., Shen, W., Guibas, LJ., Malik, J.  Savarese, S.  2018June. Taskonomy: Disentangling Task Transfer Learning Taskonomy: Disentangling task transfer learning. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). The ieee conference on computer vision and pattern recognition (cvpr).
  • Zech . (2018) Zech2018Zech, JR., Badgeley, MA., Liu, M., Costa, AB., Titano, JJ.  Oermann, EK.  2018. Confounding variables can degrade generalization performance of radiological deep learning models Confounding variables can degrade generalization performance of radiological deep learning models.
  • Zhang . (2018) Zhang2018Zhang, Y., Miao, S., Mansi, T.  Liao, R.  2018. Task Driven Generative Modeling for Unsupervised Domain Adaptation: Application to X-ray Image Segmentation Task Driven Generative Modeling for Unsupervised Domain Adaptation: Application to X-ray Image Segmentation. 21–9. 10.1007/978-3-030-00934-2_67
  • Zhong . (2010) ZhongZhong, E., Fan, W., Yang, Q., Verscheure, O.  Ren, J.  2010. Cross validation framework to choose amongst models and datasets for transfer learning Cross validation framework to choose amongst models and datasets for transfer learning. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) ( 6323 LNAI,  547–562). 10.1007/978-3-642-15939-8
  • Zhu . (2017) zhu2017unpairedZhu, JY., Park, T., Isola, P.  Efros, AA.  2017.

    Unpaired image-to-image translation using cycle-consistent adversarial networks Unpaired image-to-image translation using cycle-consistent adversarial networks.

    arXiv preprint.