1 Introduction
In the past few years, the research community has witnessed the fast developmental pace of deep learning [LeCun . (2015)]
approaches for unstructured data analysis, arguably establishing an important scientific milestone. Deep neural networks constitute a paradigmatic shift from traditional machine learning approaches for unstructured data. Whereas the latter rely on handcrafted feature engineering for improving learning over images, text, audio, and similar unstructured input spaces, deep neural networks are capable of automatically learning robust hierarchical features, in what is known today as
representation learning. Deep learning approaches have achieved humanlevel performance on many tasks, sometimes surpassing it on applications such as natural image classification [He . (2016)], or arrhythmia detection from medical imaging [Rajpurkar . (2017)].Due to its popularity and strong results in many domains, deep learning attracted a lot of attention from the medical imaging community. A recent survey by Litjens et al. [Litjens . (2017)] analyzes more than 300 studies from the medical imaging domain, and the authors found out that deep neural networks became pervasive throughout the entire field of medical imaging, with a significant increase in the number of published studies between 2015 and 2016. The survey also identifies that the most addressed task is image segmentation, potentially due to the importance of quantification of anatomical structures and pathologies [Gros . (2018)], as opposed to less informative tasks such as classification of pathologies or detection of structures.
Deep neural networks are thus becoming the norm in the medical imaging field, though there are still several unsolved challenges that need to be properly addressed. For instance, one of the most wellknown problems is the high sample complexity, or how much data deep learning requires to accurately learn and perform well on unseen images, which is linked to the concepts of model complexity and generalization, an active research topic in learning theory [Neyshabur . (2017)].
The large amount of required data to train deep neural networks can be partially mitigated with techniques such as transfer learning
[Yosinski . (2014), Zamir . (2018)]. However, transfer learning is problematic in medical imaging because a large dataset is still required so the models can benefit from the inductive transfer process. Differently from natural images, where annotations can be easily and quickly provided by nonexperts, medical images require careful and timeconsuming analysis from trained experts such as radiologists.Yet another challenge when deploying deep learning models to medical imaging analysis – and perhaps one of the most difficult to solve – is the socalled data distribution shift: variability inherent to the different imaging protocols can result in significantly different data distributions. Therefore, models trained under the empirical risk minimization (ERM) principle, might fail to generalize to other domains due to its strong assumptions. ERM is the statistical learning principle behind many machine learning methods, and it offers good learning guarantees and bounds if its assumptions hold, such as the fact that the train and test datasets come from the same domain. However, as we saw, this assumption is usually broken on real application scenarios.
When a deep learning model that assumes independent and identicallydistributed (iid) data is trained with images from one domain and then it is deployed on images from a different domain (e.g., distinct center), which follow a distinct probability distribution function, its performance degrades by a large margin. A concrete example of domain shift can be found in magnetic resonance imaging (MRI), where the same machine vendor using the same protocol for the same subject can produce different images. Variability is also much more salient between different centers where there are differences in machine vendor, protocol, and resolution, among others. A visual example of intercenter differences in data distribution can be seen in Figure
1, where we show samples of different centers from the Gray Matter (GM) segmentation challenge dataset [Prados . (2017)]. In Figure 2, we show the voxel intensity distribution for the same dataset.Although this distribution shift is very common in medical imaging, the problem is surprisingly ignored during the design of many different challenges in the field. It is very common to have the same domain data (same machine, protocol, etc) on both training and test sets. However, this homogeneous data split often does not represent the reality and in many cases will produce overoptimistic evaluation results. On realistic scenarios, it is very rare to have labeled data available from a new center before training a model, and it is very common to use a pretrained model from a different domain on completely different data. Therefore it is paramount to have a proper evaluation avoid contaminating the test set with data from the same domain that is present on the training set. Incurring the risk of the detrimental effects of inadequate evaluations [Zech . (2018)]
. The name given to learn a classifier model or any other predictor with a shift between the training and the target/test distributions is known as “domain adaptation” (DA). In this work we expand a previouslydeveloped method
[French . (2017)] for DA based on the Mean Teacher [Tarvainen Valpola (2017)] approach, to segmentation tasks, the most addressed task in medical imaging.We provide the following contributions: we extend the unsupervised domain adaptation method using selfensembling for the semantic segmentation task. To the best of our knowledge, this is the first time this method is used for semantic segmentation and also in the medical imaging domain. We explore many components of the model such as different consistency losses and we perform an extensive evaluation and ablation experiments on a realistic small data regime dataset from the magnetic resonance imaging (MRI) domain. We also provide visualizations to get insights on the model dynamics for the unsupervised domain adaptation task.
This paper is organized as follows. In Section 2 we present related work, whereas in Section 3
we give a short formalization of the unsupervised domain adaptation task and its connection with semisupervised learning. In Section
4 we detail our method in terms of model architecture and corresponding design decisions. In Section 5 we describe the dataset we use in our experiments and how we performed the data split for the domain adaptation scenario. In Section 6 we provide the experiment results, followed by an ablation study in Section 7. In Section 8 we provide visual insights regarding the model’s adaptation dynamics for multiple domains. Finally, in Section 9 we discuss our findings and the corresponding limitations of this work. In the spirit of open science and reproducibility, we also provide more information regarding data and sourcecode availability in Section 10.2 Related Work
Deep learning based methods for segmentation in medical imaging are being vastly explored in recent years [Litjens . (2017)] and may vary in the specifics on how they handle the task. Most of the initial work was focused on patchbased segmentation [Coupé . (2011)]
, preceding the pioneering deep learning models. With the growing interest on deep learning for several computer vision tasks, the first attempts on using Convolutional Neural Networks (CNNs) for image segmentation were based on processing image patches through a sliding window, which yielded segmented patches. Those independent segmented patches were then concatenated for the creation of the final segmented image
[Lai (2015)]. The main drawbacks of this approach are regarding computational cost – several forward passes for generating the final result – as well as regarding inconsistency in predictions – which can be fixed by overlapping sliding windows.Even though patchwise methods are still being researched [Hou . (2016)] and have led to several advances in segmentation [Lai (2015)], the most common deep architecture for segmentation nowadays is the socalled Fully Convolutional Network (FCN) [Long . (2015)]
. This architecture is based solely on convolutional layers with the final result not depending on the use of fullyconnected layers. FCNs can provide a fullysegmented image within a single forward step with variable output size depending on the input tensor size. One of the most wellknown FCNs for medical imaging is Unet
[Ronneberger . (2015)], which combines convolutional, downsampling, and upsampling operations with skip nonresidual connections. We make use of UNet throughout this work aiming for generalizable conclusions. This architecture is further discussed in Section
4.3.Deep Domain Adaptation (DDA), which is a field unrelated in essence to medical imaging, has been widely studied in the recent years [Wang Deng (2018)]. We can divide the literature on DDA as follows: (i) methods based on building domaininvariant feature spaces through autoencoders [Ghifary . (2016)], adversarial training [Ganin . (2016)], GANs [Hoffman . (2017), Sankaranarayanan . (2018)], or disentanglement strategies [Liu . (2018), Cao . (2018)]; (ii) methods based on the analysis of higherorder statistics [Li . (2016), Sun Saenko (2016)]; (iii) methods based on explicit discrepancy between source and target domains [Tzeng . (2014)]; and (iv) methods based on implicit discrepancy between domains, also known as selfensembling [French . (2017), Tarvainen Valpola (2017)].
In [Hoffman . (2017)]
, the authors train GANs with cycleconsistent loss functions
[Zhu . (2017)] to remap the distribution from the source to the target dataset, thus creating target domain specific features for completing the task. In [Sankaranarayanan . (2018)], GANs are employed as a means of learning aligned embeddings for both domains. Similarly, disentangled representations for each domain have been proposed [Liu . (2018), Cao . (2018)] with the goal of generating a feature space capable of separating domaindependent and domaininvariant information.In [Li . (2016)]
, the authors propose changing parameters of the neural network layers for adapting domains by directly computing or optimizing higherorder statistics. More specifically, they propose an alternative for batch normalization called Adaptive Batch Normalization (AdaBN) that computes different statistics for the source and target domains, hence creating domain invariant features that are normalized according to the respective domain. In a similar fashion, Deep CORAL
[Sun Saenko (2016)] provides a loss function for minimizing the covariances between target and source domain features.Discrepancybased methods pose a different approach to DDA. By directly minimizing the discrepancy between activations from the source and target domain, the network learns to generate reasonable predictions while incorporating information from the target domain. The seminal work of Tzeng et al. [Tzeng . (2014)] directly minimizes the discrepancy between a specific layer with labeled samples from the source set and unlabeled samples from the target set.
Implicit discrepancybased methods such as selfensembling [French . (2017)] have become widely used for unsupervised domain adaptation. Selfensembling is based on the Mean Teacher network [Tarvainen Valpola (2017)], which was first introduced for semisupervised learning tasks. Due to the similarity between unsupervised domain adaptation and semisupervised learning, there are very few adjustments that need to be made to employ the method for the purposes of DDA. Mean Teacher optimizes a task loss and a consistency loss, the latter minimizing the discrepancy between predictions on the source and target dataset. We further detail how Mean Teacher works in Section 4.1.
There are few studies that report on the consequences of domain discrepancy for medical imaging by making use of the unsupervised domain adaptation literature. The work in [AlBadawy . (2018)] discusses the impact of deep learning models across different institutions, showing a statisticallysignificant performance decrease in crossinstitutional trainandtest protocols. A few studies attempt at directly approaching domain adaptation in medical imaging through adversarial training [Kamnitsas . (2017), Chen . (2018), Zhang . (2018), Lafarge . (2017), Javanmardi Tasdizen (2018), Dou . (2018)], some generating artificial images for leveraging training data [Mahmood . (2018), Madani . (2018)]. Nevertheless, to the best of our knowledge, we are the first to address the problem of domain shift in medical imaging segmentation by extending the unsupervised domain adaptation selfensembling method for the semantic segmentation task.
3 SemiSupervised Learning and
Unsupervised Domain Adaptation
A common approach for leveraging training when few labeled examples are available is semisupervised learning, which is defined as follows: given a labeled dataset with distribution and unlabeled data with distribution
, learn from both available labeled and unlabeled data in order to either improve a supervised learning task (say classification) or an unsupervised learning task (say clustering).
Semisupervised learning methods tend to perform well when unlabeled data actually come from the same distribution as the labeled data. This allows the learning algorithm to leverage its knowledge using unlabeled data, which usually consists of the majority of examples. As promising as semisupervised learning is, the assumption that the distribution of unlabeled data is similar to often fails in realworld applications. We refer the reader to a thorough evaluation of semisupervised learning methods and their downfalls in [Odena . (2018)].
It is very common for models to be applied in scenarios that are significantly different from those in which they were originally trained on. Examples include different weather conditions for outdoor activity recognition, or different cities for training autonomous vehicles. Those changes in scenario shift the data distribution , harming the quality of the predictions in those cases where the model was not properly adapted for the desired condition.
The difference between the distributions from the examples used in training and test sets is called domain shift, which is formally defined as follows. Consider a source dataset with input distribution and a label distribution , as well as a target dataset with input distribution and labels , . Domain adaptation can be addressed via a supervised approach when labeled data from the target domain is available, or via unsupervised learning when only unlabeled data is available for the target domain.
When a method addresses the problem of domain adaptation using unlabeled data for the target domain, which is the most common and useful scenario, the task at hand is called unsupervised domain adaptation. Unsupervised domain adaptation methods assume that both distributions and are available while distribution is available and is not. In other words, only the source dataset provides labeled examples. Hence, the task is to leverage knowledge mainly from the target domain using the unlabeled data available in .
4 Method
This section details the base domain adaptation methods we are using for the task of medical imaging analysis. We further discuss the changes that are needed for allowing unsupervised domain adaptation on segmentation tasks instead of the typical classification scenario. We also detail the most important aspects one has to address regarding the segmentation of medical images.
4.1 SelfEnsembling and Mean Teacher
SelfEnsembling was originally conceived as a viable strategy for generating predictions on unlabeled data [Laine Aila (2016)]. The predictions on labeled data are combined and leverage the knowledge on unlabeled data, being used as target for semisupervised learning. The original paper proposes two different models for selfensembling. The first model, called
model, employs a consistency loss between predictions on the same input. Each input from a batch is passed through a neural network twice with different augmentation parameters, yielding two different predictions. A squared difference between those predictions is minimized alongside the cross entropy for labeled examples. The second model, called Temporal Ensembling, works on the assumption that, as the training progresses, averaging the predictions over time on unlabeled samples may contribute to a better approximation on the correct label. This pseudo label is then considered as target during training. A squared difference between averaged predictions and the current one is minimized alongside the cross entropy for labeled examples. The network performs the exponential moving average (EMA) to update the generated targets every epoch, as follows:
(1) 
SelfEnsembling was extended for directly combining model weights instead of predictions. This adaptation is thus called Mean Teacher [Tarvainen Valpola (2017)]. Considering Eq. (1) for updating the target pseudo labels, Mean Teacher updates the model weights at each step generating a somewhat improved model compared to the model without the EMA, a framework which is linked to the PolyakRuppert Averaging [Polyak Juditsky (1992), Ruppert (1988)]. In this scenario, the EMA model was named teacher and the standard model, student. The update function is as follows:
(2) 
where
is a hyperparameter that represents the weight that modulates the importance of the current model’s weights with respect to previous models. The best results are found when factor
is increased later on training. This is arguably due to the fact that as the training progresses, it ends up favoring the current model, and so should larger to given more importance to previous models.Each training step involves a loss component for both labeled and unlabeled data. All samples from a batch are evaluated by both the student and teacher models. Predictions from both models are compared via the consistency loss. The labeled data, however, is also compared to its ground truth, as traditionally performed in segmentation tasks, in what we call the task loss:
(3) 
where and are the Lagrange multipliers that represents, respectively, the consistency and regularization weights. The hyperparameter was empirically found to improve results when varying through time, given that in the earlier training steps, the network is still generating poor results. The consistency weight follows a sigmoid rampup saturating at a given hyperparameter value.
Mean Teacher also follows the dynamics of model distillation [Hinton . (2015)]. In that scenario, a trained model is used for predicting instances and its output is used as labels for another, smaller model. This is considered a good practice as soft labels tend to better represent the characteristics of the classes (e.g., the representation distance between a Syberian Husky and an Alaskan Malamute should arguably be smaller than the distance between a Syberian Husky and a Persian Cat). Unlike traditional distillation formulations, the Mean Teacher framework also uses the teacher model to generate labels for unlabeled data and represents a model of the same size that is simultaneously updated during training.
The Mean Teacher framework was also extended for unsupervised domain adaptation in [French . (2017)]. Among the proposed changes, the authors modify the data batches so every batch consists of both images from the source and target domain. At each step, the student model evaluates images from the source domain and computes derivatives via a task loss based on the ground truth. The target domain images, which are unlabeled, are used to compute the consistency loss by comparing predictions from both student and teacher models. Differently from its original formulation, the teacher model only has access to unlabeled examples (in this case, examples from the target domain). Each loss function is thus responsible for improving learning at a single domain. The task loss is evaluated by comparing the predictions against the ground truth for the labeled examples (source domain). For the consistency loss, MSE is often used to evaluate the predictions from the student and teacher models for the unlabeled examples (target domain).
4.2 Adapting Mean Teacher for Segmentation Tasks
Both the original and adapted Mean Teacher versions for unsupervised domain adaptation rely on the crossentropy classification cost. Considering we are not dealing with classification, but with a segmentation task, we need to minimize a different loss function that takes into consideration the specificities of that task. Originally proposed in [Milletari . (2016)], the dice loss generates reliable segmentation predictions due to its insensitivity to class imbalance:
(4) 
where and are flattened predictions and ground truth values for an instance, respectively. Dice was kept as the task loss for both baseline and adaptation experiments. Note that the dice loss is computed for the entire batch at once, unlike the typical strategy of averaging when using crossentropy, for instance.
A second problem when training the student and teacher models for segmentation tasks is the inconsistency introduced between training samples of the student and teacher models when a affine transformation such as translation or rotation (or any other spatialchanging transformation) is applied with different parameters to both inputs of the teacher and student models. To solve that problem we used the same approach employed by [Perone CohenAdad (20181)] as shown in the Figure 4. The augmentation in this case, depicted by the transformation, where is the input data and are the transformation parameters (i.e. rotation angle), is applied for the student model before feeding data into the model and for the teacher model it is applied with the same parameters on a delayed fashion on the predictions of the teacher model, causing both predictions to be aligned for the consistency loss. This is possible because the backpropagation takes place only for the student model, therefore there is no need for differentiation on the delayed augmentation of the teacher model. An overview of the proposed method can be seen in Figure 3. Examples of images after data augmentation and their respective compensated ground truth are shown in Figure 5.
4.3 Model architecture employed
A Unet [Ronneberger . (2015)] model architecture with 15 layers, group normalization [Wu He (2018)] (discussed later) and dropout was employed for all experiments. Since Unet is widely applied in biomedical, we believe that our results would generalize to a wide spectrum of applications.
To produce a fair comparison, we followed the recommendations from [Oliver . (2018)] and kept the same model for the baseline and for our method, thus avoiding conflating comparisons. Even though the mean teacher method usually acts also as a regularization of the model, we still kept the same regularization weights for all comparisons, however it is important to note that the regularization can be adjusted and thus improve the results of the mean teacher even further.
4.4 Baseline employed
We conducted a hyperparameter search to find a good baseline model. This search yield the parameters minibatch size of 12 and dropout rate of 0.5. For training we used Adam optimizer [Kingma Ba (2015)] with penalty factor of and and . As for the learning rate, we used a sigmoid learning rate rampup until epoch 50 followed by a cosine rampdown until epoch 350. The Equation 5 shows the sigmoid rampup formula:
(5) 
where is the highest learning rate and represents the ratio between the current epoch and the total rampup epochs. Equation 6 presents the cosine rampdown:
(6) 
where is the highest learning rate and is the ratio between the number of epochs after the rampup procedure and the total number of epochs expected for the training.
For a fair comparison, no hyper parameter from the baseline model is changed in the adaptation scenario. The only changes done are in parameters that affect only the domain adaptation training aspects. This leads to easier comparison and realistic evaluation on the benefits of using domain adaptation.
4.5 Consistency loss
The consistency loss is one of the most important aspects of the Mean teacher. If the measured difference between the predictions from the teacher and student isn’t representative for distilling the knowledge on the student model, the method will not work properly or even diverge the training. The originally proposed function for the Mean teacher is the meansquared error.
(7) 
where and are flattened predictions from the student and teacher, respectively.
Cross entropy is another important function that we investigated. It is more commonly used as the main task loss of classification tasks but as it aims at matching distributions, it could potentially improve the gradients for the student:
(8) 
where and are predictions from the student and teacher, respectively.
We expanded the study for multiple loss functions in our scenario and we found several other losses that could potentially be used as consistency losses.
Our initial experiments lead to weighted variants of MSE, since it could improve the class imbalance problem. However, this approach relies on thresholding predictions from the teacher as to define binary expected voxel values for the student. We found that defining both the correct weights and the threshold value was difficult and did not improve results.
The same problem persists with more complex losses such as the Focal Loss [Lin . (2018)] due to additional hyperparameters (in this case and ).
We also tried two additional losses: the Dice Loss, already presented in section 4 and the Tversky Loss [Salehi . (2017)]. The Tversky Loss is a variation of Dice that aims at mitigating the problem of class imbalance, common in medical imaging segmentation tasks.
(9) 
where and represent the predicted probabilities of a voxel belonging to the spinal cord, and the predicted probabilities of a voxel belonging to any other tissue. There are also additional and hyperparameters that adjust for the class imbalance. This also leads to the same problem of correctly setting more hyperparameters alongside the consistency weight value.
We later identified that both Dice and Tversky have a problem when being used as consistency losses. Although they best represent the nature of the task and thus could potentially improve predictions, the values from the loss had an undesirable behavior. This is due to the equation design, as their main operation is based on multiplication but is expected to belong to . Both Dice and Tversky tend to work when ground truth labels are binary, but in this case which we have soft labels from the teacher, even when both the teacher and the student identically match the output, the loss will not be minimum.
Intuitively, given , . For example, when and , the numerator result is . Contrastively, when and , the score should increase, but instead goes to . This issue grows for increasingly lower values of and .
One way to surpass this issue is to threshold teacher predictions. This generates hard labels and the equation should work properly. However, we struggled to identify what a suitable threshold value should be as it drastically impacts how the network adapts and it somewhat diminishes the benefits of using a distillationbased approach. An alternative would be to accommodate the equation to properly handle soft labels (values between 0 and 1). This should maintain the results from the original equation when it works properly and provide sound values when the original fails. We leave such alternative equations for future work.
4.6 Batch normalization and group normalization in the domain adaptation scenario
Batch Normalization [Ioffe Szegedy (2015)] is a method that aims to improve the training of deep neural networks through the stabilization of the distribution of layer inputs. Batch Normalization is nowadays pervasive in most of deep learning architectures, allowing the use of large learning rates and helping with the convergence of deep networks.
Initially thought to help with the internal covariate shift (ICS) [Ioffe Szegedy (2015)], Batch Normalization was recently found [Santurkar . (2018)] to smooth the optimization landscape of the network, manifested by its improvement of the Lipschitzness, or smoothness [Santurkar . (2018)], of both the loss and gradients.
Batch Normalization works differently for training and inference. During the training time, the normalization happens using the batch statistics while on inference time, it uses the population statistics, usually estimated with moving averages on each batch during the training procedure. This procedure, however, is problematic for Domain Adaptation using Mean Teacher given that there are multiple distributions being fed during training time, causing the Batch Normalization statistics to be computed using both source and target domain data.
One approach that can be used to solve that issue is to use different batch statistics for the source and target domain as in AdaBN [Li . (2016)]. The implementation of this approach using modern frameworks are easy during training time because all it requires is to forward the batch for each domain separately as done in [French . (2017)], however, this approach will still use both source and target domain data to compute the running average used during inference. One must keep separate running means as well for the estimation of the population statistics, however this increases the complexity of the training, especially when training on a multiGPU scenario with small batch size, which are very common in segmentation tasks, where synchronization is essential.
Besides the mentioned issues, Batch Normalization also suffers from suboptimal results when using small batch sizes [Wu He (2018)], which are very common on segmentation tasks due to memory requirements. For these reasons, we used Group Normalization [Wu He (2018)]
, an alternative to Batch Normalization that divides the channels into groups and computes mean and variance within each group independently of the batch sizes. Group Normalization works consistently better than Batch Normalization with small batch sizes and doesn’t keep running averages for the population statistics, simplifying the training and inference procedures and providing better results for our scenario involving domain adaptation and segmentation tasks.
4.7 Hyperparameters for unsupervised domain adaptation
A problem that is usually faced by many techniques for unsupervised domain adaptation is the hyperparameter selection (such as the learning rate, consistency weight, etc). On a unsupervised setting, there are no labeled data from the target domain, so the estimation of parameters from the source distribution alone can be completely different from the parameters estimated from the target distribution.
An alternative method to solve this issue is to use reverse crossvalidation [Zhong . (2010)] as used also by [Ganin . (2016)]. However, this technique adds more complexity to the validation process. We found, however, that the estimation of hyperparameters of the mean teacher approach on the source domain yielded robust results, however this is a limitation of our evaluation as one could achieve better results for our our proposed method by incorporating a better hyperparameter estimation procedure.
5 Materials
This section describe the datasets used in this work.
5.1 Spinal Cord Gray Matter Challenge dataset
The Spinal Cord Gray Matter Challenge [Prados . (2017)] dataset is a multicenter, multivendor and publiclyavailable MRI data collection that is comprised by 80 healthy subjects with 20 subjects from each center.
The demographics of the dataset ranges from a mean age of 28.3 up to 44.3 years old. Three different MRI systems were employed (Philips Achieva, Siemens Trio, Siemens Skyra) with different acquisition parameters. The voxel size resolution of the dataset ranges from 0.25x0.25x2.5 mm up to 0.5x0.5x5.0 mm. The dataset is split between training (40) and test (40) with the test set labels hidden (not publicly available). For each labeled slice in the dataset, 4 gold standard segmentation masks were manually created by 4 independent expert raters (one per participating center). For more information regarding dataset, such as the MRI parameters, please refer to the work from [Prados . (2017)].
Due to the fact that the Spinal Cord Gray Matter Challenge dataset contains data from all 4 centers both in training as well as in the test, we used a nonstandard split of the data in order to evaluate our technique on a domain adaptation scenario where the domain present in the test set didn’t contain contamination from the training data domain. Therefore, we used the centers 1 and 2 as the training set, the center 3 as the validation set and the center 4 as the test set.
We used the unlabeled data from the challenge center 4 test set (that doesn’t originally contains publiclyavailable labels) as the unlabeled data for the target domain and used the center 4 training data (with labels) as the test set to evaluate the final performance of our model. We also resampled all samples of the dataset to a common space of 0.25x0.25 mm.
An overview of the dataset split is graphically shown in Figure 6.
6 Experiments
We created several experiments to understand the behavior of different aspects of domain adaptation on the medical imaging domain. We also did ablation studies and evaluated multiple metrics for each center.
6.1 Adapting on different centers
We maintained training at both centers 1 and 2 in a supervised manner. We them adapt the network at centers 3 and 4 separately. This way, we can observe three related but independent aspects of adaptation and semisupervised learning.

How the network changes its predictions on images from the source domain as images from different domains are presented;

How the network adapts its predictions for the adapted domain after the domain adaptation;

How an adapted network generalizes when presented with images that weren’t used during the training neither as supervised signal nor as a unsupervised adaptation component.
The results are presented in Table 1. We now aim at better understand the behavior by individually answering the proposed questions.
Evaluation  Adaptation  Dice  mIoU  Recall  Precision  Specificity  Hausdorff 

Center 3  Baseline  82.81 0.33  71.05 0.36  90.61 0.63  77.09 0.34  99.86 0.0  2.14 0.02 
Center 3  84.72 0.18  73.67 0.28  87.43 1.90  83.17 1.62  99.91 0.01  2.01 0.03  
Center 4  84.45 0.14  73.30 0.19  87.13 1.77  82.92 1.76  99.91 0.01  2.02 0.03  
Center 4  Baseline  69.41 0.27  53.89 0.31  97.22 0.11  54.95 0.35  99.70 0.00  2.50 0.01 
Center 3  73.27 1.29  58.50 1.57  94.92 1.48  60.93 2.51  99.77 0.03  2.36 0.06  
Center 4  74.67 1.03  60.22 1.24  93.33 1.96  63.62 2.42  99.80 0.02  2.29 0.05 
Evaluation results on different centers. The evaluation and adaptation columns represent, respectively, the centers where testing and adaptation data were collected. The numeric results show the mean and standard deviation over 10 runs (with independent random weights initializations). Highlighted values represent the best at each center. All experiments were trained in both centers 1 and 2 simultaneously. Dice represents the Sørensen–Dice coefficient, mIoU represents the mean Intersection over Union. Other metrics are selfexplanatory.
For Question 1 we can observe the evaluation on centers 1 and 2. Both centers are included in the training set and we want to observe whether additional unsupervised data from different domains (centers 3 or 4) could improve generalization on the original centers (1 and 2). At both adapted centers (3 and 4), the results in almost all metrics are superior against the baseline, except for recall, meaning that there is a positive change in prediction performance for the source domain after the domain adaptation on different domains with unlabelled data.
Question 2 can be investigated by interpreting results from the rows with evaluation center and adaptation center both on center 3, and evaluation center and adaptation center both on center 4. Both rows present the highest values in almost every metric, except for recall. This means that domain adaptation is working properly for the given scenario.
Question 3 can also be inferred from the results. By observing evaluation center 3 with adaptation on center 4, and evaluation center 4 with adaptation on center 3, we perceive gains against the baseline in almost all metrics. We can infer that domain adaptation in this case is helping generalization for unseen centers.
6.2 Different consistency losses
We execute multiple runs of the Mean teacher algorithm to determine which consistency losses work best. We focused just on losses that did not contain any additional hyperparameters. For example, the Tversky Loss [Salehi . (2017)] presents a very similar approach to the Dice Loss, but with two additional hyperparameters ( and ). Thus, we removed it to keep fair comparisons among several loss function, as fair comparison would take much higher computational time for combining both hyperparameters and consistency weight values.
These constraints limit us to cross entropy, meansquared error, and dice loss, already described in 4. We believe, however, that this is an important aspect of proper domain adaptation and thus leave further investigation of other losses for future work.
6.3 Behavior of Dice predictions and thresholding
It is known that networks training with the Dice loss usually produces predictions where their distribution are concentrated on the upper and lower bounds of the probability distribution, with very low entropy. As in [Perone CohenAdad (20182)], we also used a high threshold value of 0.99 for the Dice predictions in order to produce a balanced model.
We found, however, that our domain adaptation method also regularizes the network predictions, shifting the Dice probability distribution out from the probability bounds. For that reason with fixed and used a lower Dice prediction threshold value of 0.9 instead of 0.99, which produced a more balanced model in terms of precision and recall.
6.4 Training stability
For the task of unsupervised domain adaptation, it is important that the training becomes somewhat stable. As in the most difficult scenarios there are no annotations for validating the adaptation, the unstable training might produce suboptimal adaptation results.
To evaluate the training stability, we ran different consistency weights for each possible consistency loss and evaluate the difference between the best values found and the final results after 350 epochs. Table 2 summarizes the results from these experiments.
We can observe that cross entropy consistently fails for different weight values, but also achieve high dice values in its best scenario during training. Cross entropy becomes then a possible alternative to MSE when a few annotated images for validation are available in the target domain. Figure 7 shows how the training diverges for cross entropy loss after several iterations.
We can observe that both Dice and cross entropy have trouble stabilizing the training after achieving high results. However, MSE tends to be more invariant to consistency weight, thus being a robust approach when no annotated data is available at the target center.
As in [French . (2017)], we also experimented with confidence thresholding, however we found no improvements by doing so.
Loss  Weight  Dice  mIoU  Recall  Precision  Specificity  Hausdorff 

CE  5  0.00 (85.50)  0.00 (74.91)  0.00 (95.01)  0.00 (98.90)  100.0 (100.00)  0.00 (0.00) 
10  0.00 (80.73)  0.00 (69.54)  0.00 (83.21)  0.00 (98.78)  100.0 (100.00)  0.00 (0.00)  
15  6.43 (37.03)  4.89 (26.06)  5.38 (77.05)  17.34 (65.85)  100.0 (100.00)  0.28 (0.00)  
20  2.30 (67.61)  1.86 (52.55)  2.09 (65.00)  7.94 (96.57)  100.0 (100.00)  0.12 (0.03)  
Dice  5  76.76 (80.74)  62.76 (68.16)  97.88 (99.66)  63.72 (72.50)  99.71 (99.81)  2.36 (2.16) 
10  4.77 (10.55)  2.45 (5.64)  96.25 (99.99)  2.45 (5.85)  79.59 (99.75)  8.80 (2.57)  
15  2.30 (7.74)  1.16 (4.12)  99.95 (100.00)  1.16 (4.62)  55.07 (99.80)  11.75 (2.50)  
20  1.79 (4.43)  0.90 (2.27)  99.99 (100.00)  0.90 (2.30)  42.02 (99.84)  12.68 (2.43)  
MSE  5  83.7 (83.88)  72.2 (72.46)  91.24 (98.19)  78.1 (78.57)  99.87 (99.93)  2.1 (2.00) 
10  84.38 (84.38)  73.19 (73.19)  90.15 (99.07)  80.12 (80.12)  99.88 (99.94)  2.05 (1.89)  
15  84.59 (84.59)  73.49 (73.50)  89.19 (98.52)  81.28 (81.28)  99.89 (99.89)  2.03 (2.03)  
20  84.5 (84.50)  73.36 (73.37)  90.36 (94.63)  80.16 (80.16)  99.88 (99.98)  2.05 (1.46)  
Evaluation  Version  Dice  mIoU  Recall  Precision  Specificity  Hausdorff 

Center 3  Baseline  83.06  71.36  90.98  77.24  99.86  2.13 
EMA  83.09  71.40  90.97  77.30  99.86  2.13  
Center 4  Baseline  69.41  53.90  97.20  54.98  99.70  2.48 
EMA  69.50  54.00  97.19  55.09  99.71  2.48 
7 Ablation Experiments
This section describes the ablation experiments performed to rule out evidence of external improvement factors.
7.1 Exponential moving average (EMA)
The improvement saw on Table 1 can also be explained by introduction of the exponential moving average (EMA) during the training procedure, by averaging and smoothing the SGD trajectory. However, we want to ground the evidence that the improvement is from unlabeled data and not only from the exponential average component. Therefore, we performed an ablation experiment by leaving the EMA active and setting the consistency weight to zero, evaluating therefore the impact of the exponential average but without taking into consideration the unlabeled data used to enforce consistency.
We executed the same experimental setting from the Table 1 but with the consistency weight set to zero. The results are presented in Table 3 and show that the EMA model (teacher) presents no gains over the nonaveraged model (the supervised baseline). This is arguably due to a poorly chosen , however, the Mean teacher, which heavily relies on the EMA model, was able to outperform purely supervised methods with great margin.
8 Visualizing domain shift
We investigate how domain adaptation affects the prediction space of segmentation at different centers. By using the tSNE [Maaten Hinton (2008)] method we are able to see the changes on the network’s perception of unsupervised data. Data shown in the following charts wasn’t presented during training.
We ran two baselines for this experiment. The first model was trained in a supervised manner following the same hyper parameters presented in Section 4.4
. The other is an adaptation scenario where both centers 1 and 2 were used as supervised centers and 3 as adaptation target. The vectors projected with tSNE represent the features from the network before the final sigmoid activation. This leads to an easier separation due to predictions not being squashed between 0 and 1. Furthermore, projecting the predictions leads to a somewhat simple interpretation on how the network’s output is dealing with unseen or unsupervised data.
Both tSNE executions had learning rate set to 10, perplexity to 30 and ran for about 1000 iterations^{1}^{1}1We used the TensorBoard embedding projector, available at https://github.com/tensorflow/tensorboard . We noted that more iterations preserved the groups structure but further compressed them. This made visualizing the centers hard, so we chose 1000 as a good tradeoff between identifying emerging groups and interpretability.
The results from the supervised experiment are shown in Figure 7(a). As one can observe, there is a big separation between data from centers presented during training (1 and 2) and centers unseen (3 and 4). This shows that the network predictions differ severely according to the center to which the instance belongs.
When adapting the network with unlabelled examples from a different domain, the predictions become more diffuse, at least for centers presented during training. The results from the unsupervised adaptation experiment are shown in Figure 7(b). In this scenario, centers presented with labeled data (centers 1 and 2) form clusters with domains seen only in an unsupervised fashion (3) or not presented to the network for training in any way (4). A possible explanation for the clusters becoming between Centers 1 and 4, and Center 2 and 3 is due to how close their voxel intensity distribution is. The original voxel distribution can be seen in Figure 2.
We further address this possibility in Figure 9. It can be perceived that points in prediction space tend to get closer by their pixel distribution. This affirmation does not maintain in the supervised version presented in Figure 7(a). We believe that the reason is that by presenting images, even in an unsupervised fashion, the network was able to better map the data manifold. This can lead to better generalization and thus more controlled and trustworthy predictions in unseen centers. It also brings room for exploring error estimation based directly on the input distribution. We leave such possibilities for future work.
9 Conclusion and limitations
Variability in many medical imaging modalities is still a big problem for machine learning methods. The different parametrizations that can be used to acquire imaging data and the lack of standardized protocols and industry standards are pervasive across the entire field.
In this work, we showed that unsupervised domain adaptation can indeed be an opportunity to increase the performance of these models for medical imaging at multiple centers without depending on annotations, an expensive resource to acquire in medical imaging.
Both through the evaluation of multiple metrics and data analysis we showed how selfensembling methods can improve generalization on unseen domains through the leverage of unlabeled data from different domains. On the ablation experiment, we were able to observe how Mean teacher is able to take leverage of the unlabeled data even without an exhaustive search on the hyperparameter space.
We also observed how cross entropy loss failed to maintain training stability by diverging the training as the number of epochs increased when used as consistency loss function. We discussed how this can lead to potential problems in more challenging scenarios for multiple centers. We also showed the issues that arise when using the Dice loss as the consistency loss.
We are also aware of the limitations of the present study as we do not evaluate adversarial training methods for domain adaptation. Even considering that the Mean teacher is currently the stateoftheart method on many datasets, we believe that further evaluations of it on the same realistic small data regime could further increase the significance of our contributions and thus we leave this for future work.
Another limitation is the single task evaluation of the gray matter segmentation, which can be expanded to other tasks in other domains. Increasing the number of centers alongside the number of tasks to show whether the pattern found in this study persists in other scenarios is an important missing contribution.
Further work on the field could pave the way for methods that can measure the risk of adaptation to particular centers or domains. This is definitely an important step towards understanding the limitations of the domain adaptation methods. We believe that the problems that arise from the variability of medical imaging modalities require rethinking the strong assumptions that machine learning methods assume, however this nevertheless poses complex difficulties due to its foundational nature.
One important step in the direction of assessing the current methods limitations, is to reinforce the importance of proper multidomain evaluation in studies and medical imaging challenges, that rarely provide a test set from different domains and containing the realistic variability found in real scenarios.
10 Sourcecode and dataset availability
In the spirit of Open Science and reproducibility, the sourcecode used to reproduce the experiments and to replicated the results of this work can be found at our repository^{2}^{2}2https://github.com/neuropoly/domainadaptation.
The dataset used for this work is also available on the Spinal Cord Gray Matter Segmentation Challenge website upon a data license agreement^{3}^{3}3http://cmictig.cs.ucl.ac.uk/niftyweb/program.php?p=CHALLENGE.
11 Acknowledgments
Funded by the Canada Research Chair in Quantitative Magnetic Resonance Imaging [950230815], the Canadian Institute of Health Research [CIHR FDN143263], the Canada Foundation for Innovation [32454, 34824], the Fonds de Recherche du Québec  Santé [28826], the Fonds de Recherche du Québec  Nature et Technologies [2015PR182754], the Natural Sciences and Engineering Research Council of Canada [4358972013], the Canada First Research Excellence Fund (IVADO and TransMedTech) and the Quebec BioImaging Network [5886]. This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nivel Superior – Brasil (CAPES) – Finance Code 001.
References
 AlBadawy . (2018) albadawy2018deepAlBadawy, EA., Saha, A. Mazurowski, MA. 2018. Deep learning for segmentation of brain tumors: Impact of crossinstitutional training and testing Deep learning for segmentation of brain tumors: Impact of crossinstitutional training and testing. Medical physics4531150–1158.
 Cao . (2018) cao2018didaCao, J., Katzir, O., Jiang, P., Lischinski, D., CohenOr, D., Tu, C. Li, Y. 2018. DiDA: Disentangled Synthesis for Domain Adaptation Dida: Disentangled synthesis for domain adaptation. arXiv preprint arXiv:1805.08019.
 Chen . (2018) Chen2018Chen, C., Dou, Q., Chen, H. Heng, PA. 2018. SemanticAware Generative Adversarial Nets for Unsupervised Domain Adaptation in Chest Xray Segmentation SemanticAware Generative Adversarial Nets for Unsupervised Domain Adaptation in Chest Xray Segmentation.
 Coupé . (2011) coupe2011patchCoupé, P., Manjón, JV., Fonov, V., Pruessner, J., Robles, M. Collins, DL. 2011. Patchbased segmentation using expert priors: Application to hippocampus and ventricle segmentation Patchbased segmentation using expert priors: Application to hippocampus and ventricle segmentation. NeuroImage542940–954.
 Dou . (2018) Dou2018Dou, Q., Ouyang, C., Chen, C., Chen, H. Heng, PA. 2018. Unsupervised CrossModality Domain Adaptation of ConvNets for Biomedical Image Segmentations with Adversarial Loss Unsupervised CrossModality Domain Adaptation of ConvNets for Biomedical Image Segmentations with Adversarial Loss .
 French . (2017) french2017selfFrench, G., Mackiewicz, M. Fisher, M. 2017. Selfensembling for visual domain adaptation Selfensembling for visual domain adaptation. arXiv preprint arXiv:1706.05208.
 Ganin . (2016) ganin2016domainGanin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F.Lempitsky, V. 2016. Domainadversarial training of neural networks Domainadversarial training of neural networks. The Journal of Machine Learning Research1712096–2030.
 Ghifary . (2016) ghifary2016deepGhifary, M., Kleijn, WB., Zhang, M., Balduzzi, D. Li, W. 2016. Deep reconstructionclassification networks for unsupervised domain adaptation Deep reconstructionclassification networks for unsupervised domain adaptation. European Conference on Computer Vision European conference on computer vision ( 597–613).
 Gros . (2018) Gros2018Gros, C., De Leener, B., Badji, A., Maranzano, J., Eden, D., Dupont, SM.CohenAdad, J. 2018may. Automatic segmentation of the spinal cord and intramedullary multiple sclerosis lesions with convolutional neural networks Automatic segmentation of the spinal cord and intramedullary multiple sclerosis lesions with convolutional neural networks.

He . (2016)
He2015bHe, K., Zhang, X., Ren, S. Sun, J.
2016.
Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification.
Proceedings of the IEEE International Conference on Computer Vision1118Dece1026–1034. 10.1109/ICCV.2015.123  Hinton . (2015) hinton2015distillingHinton, G., Vinyals, O. Dean, J. 2015. Distilling the knowledge in a neural network Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
 Hoffman . (2017) hoffman2017cycadaHoffman, J., Tzeng, E., Park, T., Zhu, JY., Isola, P., Saenko, K.Darrell, T. 2017. Cycada: Cycleconsistent adversarial domain adaptation Cycada: Cycleconsistent adversarial domain adaptation. arXiv preprint arXiv:1711.03213.

Hou . (2016)
hou2016patchHou, L., Samaras, D., Kurc, TM., Gao, Y., Davis, JE. Saltz, JH.
2016.
Patchbased convolutional neural network for whole slide
tissue image classification Patchbased convolutional neural network for
whole slide tissue image classification.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Proceedings of the ieee conference on computer vision and pattern recognition ( 2424–2433).
 Ioffe Szegedy (2015) Ioffe2015Ioffe, S. Szegedy, C. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Proceedings of the 32Nd International Conference on International Conference on Machine Learning  Volume 37 Proceedings of the 32nd international conference on international conference on machine learning  volume 37 ( 448–456). JMLR.org. 10.1007/s1339801401737.2
 Javanmardi Tasdizen (2018) Javanmardi2018Javanmardi, M. Tasdizen, T. 2018. DOMAIN ADAPTATION FOR BIOMEDICAL IMAGE SEGMENTATION USING ADVERSARIAL TRAINING Scientific Computing and Imaging Institute , University of Utah DOMAIN ADAPTATION FOR BIOMEDICAL IMAGE SEGMENTATION USING ADVERSARIAL TRAINING Scientific Computing and Imaging Institute , University of Utah. Isbi554–558.
 Kamnitsas . (2017) kamnitsas2017unsupervisedKamnitsas, K., Baumgartner, C., Ledig, C., Newcombe, V., Simpson, J., Kane, A.others 2017. Unsupervised domain adaptation in brain lesion segmentation with adversarial networks Unsupervised domain adaptation in brain lesion segmentation with adversarial networks. International Conference on Information Processing in Medical Imaging International conference on information processing in medical imaging ( 597–609).
 Kingma Ba (2015) Kingma2015aKingma, DP. Ba, JL. 2015. Adam: a Method for Stochastic Optimization Adam: a Method for Stochastic Optimization. International Conference on Learning Representations 20151–15. http://doi.acm.org.ezproxy.lib.ucf.edu/10.1145/1830483.1830503

Lafarge . (2017)
Lafarge2017Lafarge, MW., Pluim, JP., Eppenhof, KA., Moeskops, P. Veta, M.
2017.
Domainadversarial neural networks to address the
appearance variability of histopathology images Domainadversarial neural
networks to address the appearance variability of histopathology
images.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)10553 LNCS83–91.
10.1007/9783319675589_10  Lai (2015) lai2015deepLai, M. 2015. Deep learning for medical image segmentation Deep learning for medical image segmentation. arXiv preprint arXiv:1505.02000.
 Laine Aila (2016) laine2016temporalLaine, S. Aila, T. 2016. Temporal ensembling for semisupervised learning Temporal ensembling for semisupervised learning. arXiv preprint arXiv:1610.02242.
 LeCun . (2015) LeCun2015aLeCun, Y., Bengio, Y., Hinton, G., Y., L., Y., B. G., H. 2015. Deep learning Deep learning. Nature5217553436–444. 10.1038/nature14539
 Li . (2016) li2016revisitingLi, Y., Wang, N., Shi, J., Liu, J. Hou, X. 2016. Revisiting batch normalization for practical domain adaptation Revisiting batch normalization for practical domain adaptation. arXiv preprint arXiv:1603.04779.
 Lin . (2018) lin2018focalLin, TY., Goyal, P., Girshick, R., He, K. Dollár, P. 2018. Focal loss for dense object detection Focal loss for dense object detection. IEEE transactions on pattern analysis and machine intelligence.
 Litjens . (2017) Litjens2017Litjens, G., Kooi, T., Bejnordi, BE., Setio, AAA., Ciompi, F., Ghafoorian, M.Sánchez, CI. 2017. A survey on deep learning in medical image analysis A survey on deep learning in medical image analysis. Medical Image Analysis4260–88. 10.1016/j.media.2017.07.005
 Liu . (2018) Liu2018Liu, YC., Yeh, YY., Fu, TC., Wang, SD., Chiu, WC. Wang, YCF. 2018. Detach and Adapt: Learning CrossDomain Disentangled Deep Representation Detach and Adapt: Learning CrossDomain Disentangled Deep Representation. Proceedings  31th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018.
 Long . (2015) long2015fullyLong, J., Shelhamer, E. Darrell, T. 2015. Fully convolutional networks for semantic segmentation Fully convolutional networks for semantic segmentation. Proceedings of the IEEE conference on computer vision and pattern recognition Proceedings of the ieee conference on computer vision and pattern recognition ( 3431–3440).
 Maaten Hinton (2008) maaten2008visualizingMaaten, Lvd. Hinton, G. 2008. Visualizing data using tSNE Visualizing data using tsne. Journal of machine learning research9Nov2579–2605.
 Madani . (2018) Madani2018Madani, A., Moradi, M., Karargyris, A. SyedaMahmood, T. 2018. Semisupervised learning with generative adversarial networks for chest xray classification with ability of data domain adaptation Semisupervised learning with generative adversarial networks for chest xray classification with ability of data domain adaptation. IEEE 15th Symposium on Biomedical ImagingIsbi1038–1042. 10.1109/ISBI.2018.8363749
 Mahmood . (2018) Mahmood2018Mahmood, F., Chen, R. Durr, NJ. 2018. Unsupervised Reverse Domain Adaptation for Synthetic Medical Images via Adversarial Training Unsupervised Reverse Domain Adaptation for Synthetic Medical Images via Adversarial Training. IEEE Transactions on Medical ImagingPPc1. 10.1109/TMI.2018.2842767
 Milletari . (2016) milletari2016vMilletari, F., Navab, N. Ahmadi, SA. 2016. Vnet: Fully convolutional neural networks for volumetric medical image segmentation Vnet: Fully convolutional neural networks for volumetric medical image segmentation. 3D Vision (3DV), 2016 Fourth International Conference on 3d vision (3dv), 2016 fourth international conference on ( 565–571).
 Neyshabur . (2017) neyshabur2017exploringNeyshabur, B., Bhojanapalli, S., McAllester, D. Srebro, N. 2017. Exploring generalization in deep learning Exploring generalization in deep learning. Advances in Neural Information Processing Systems Advances in neural information processing systems ( 5947–5956).
 Odena . (2018) odena2018realisticOdena, A., Oliver, A., Raffel, C., Cubuk, ED. Goodfellow, I. 2018. Realistic Evaluation of SemiSupervised Learning Algorithms Realistic evaluation of semisupervised learning algorithms.
 Oliver . (2018) Oliver2018Oliver, AGB., Odena, AGB., Raffel, CGB., Cubuk, EGB. Goodfellow, IJGB. 2018. Realistic Evaluation of semisupervised learning algortihms Realistic evaluation of semisupervised learning algortihms. International conference on Learning Representations1–15.
 Perone CohenAdad (20181) Perone2018aPerone, CS. CohenAdad, J. 20181sep. Deep semisupervised segmentation with weightaveraged consistency targets Deep semisupervised segmentation with weightaveraged consistency targets. DLMIA MICCAI1–8. 10.1007/9783030008895
 Perone CohenAdad (20182) Perone2018Perone, CS. CohenAdad, J. 20182. Spinal cord gray matter segmentation using deep dilated convolutions Spinal cord gray matter segmentation using deep dilated convolutions. Nature Scientific Reports81. 10.1038/s41598018243043
 Polyak Juditsky (1992) polyak1992accelerationPolyak, BT. Juditsky, AB. 1992. Acceleration of stochastic approximation by averaging Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization304838–855.
 Prados . (2017) Prados2017Prados, F., Ashburner, J., Blaiotta, C., Brosch, T., CarballidoGamio, J., Cardoso, MJ.CohenAdad, J. 2017. Spinal cord grey matter segmentation challenge Spinal cord grey matter segmentation challenge. NeuroImage152312–329. 10.1016/j.neuroimage.2017.03.010
 Rajpurkar . (2017) Rajpurkar2017Rajpurkar, P., Hannun, AY., Haghpanahi, M., Bourn, C. Ng, AY. 2017. CardiologistLevel Arrhythmia Detection with Convolutional Neural Networks CardiologistLevel Arrhythmia Detection with Convolutional Neural Networks. arXiv preprint.
 Ronneberger . (2015) Ronneberger2015Ronneberger, O., Fischer, P. Brox, T. 2015. UNet: Convolutional Networks for Biomedical Image Segmentation UNet: Convolutional Networks for Biomedical Image Segmentation. 1–8. 10.1007/9783319245744_28
 Ruppert (1988) ruppert1988efficientRuppert, D. 1988. Efficient estimations from a slowly convergent RobbinsMonro process Efficient estimations from a slowly convergent robbinsmonro process . Cornell University Operations Research and Industrial Engineering.
 Salehi . (2017) salehi2017tverskySalehi, SSM., Erdogmus, D. Gholipour, A. 2017. Tversky loss function for image segmentation using 3D fully convolutional deep networks Tversky loss function for image segmentation using 3d fully convolutional deep networks. International Workshop on Machine Learning in Medical Imaging International workshop on machine learning in medical imaging ( 379–387).
 Sankaranarayanan . (2018) Sankaranarayanan2018Sankaranarayanan, S., Balaji, Y., Castillo, CD. Chellappa, R. 2018. Generate To Adapt: Aligning Domains using Generative Adversarial Networks Generate To Adapt: Aligning Domains using Generative Adversarial Networks. Proceedings  31th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018. 10.1109/CVPR.2017.316
 Santurkar . (2018) Santurkar2018Santurkar, S., Tsipras, D., Ilyas, A. Madry, A. 2018. How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift).
 Sun Saenko (2016) sun2016deepSun, B. Saenko, K. 2016. Deep coral: Correlation alignment for deep domain adaptation Deep coral: Correlation alignment for deep domain adaptation. European Conference on Computer Vision European conference on computer vision ( 443–450).
 Tarvainen Valpola (2017) tarvainen2017meanTarvainen, A. Valpola, H. 2017. Mean teachers are better role models: Weightaveraged consistency targets improve semisupervised deep learning results Mean teachers are better role models: Weightaveraged consistency targets improve semisupervised deep learning results. Advances in neural information processing systems Advances in neural information processing systems ( 1195–1204).
 Tzeng . (2014) tzeng2014deepTzeng, E., Hoffman, J., Zhang, N., Saenko, K. Darrell, T. 2014. Deep domain confusion: Maximizing for domain invariance Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474.
 Wang Deng (2018) wang2018deepWang, M. Deng, W. 2018. Deep Visual Domain Adaptation: A Survey Deep visual domain adaptation: A survey. Neurocomputing.
 Wu He (2018) Wu2018Wu, Y. He, K. 2018. Group Normalization Group Normalization.
 Yosinski . (2014) Yosinski2014Yosinski, J., Clune, J., Bengio, Y. Lipson, H. 2014. How transferable are features in deep neural networks? How transferable are features in deep neural networks? Advances in Neural Information Processing Systems 27 (Proceedings of NIPS)271–9.
 Zamir . (2018) Zamir_2018_CVPRZamir, AR., Sax, A., Shen, W., Guibas, LJ., Malik, J. Savarese, S. 2018June. Taskonomy: Disentangling Task Transfer Learning Taskonomy: Disentangling task transfer learning. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). The ieee conference on computer vision and pattern recognition (cvpr).
 Zech . (2018) Zech2018Zech, JR., Badgeley, MA., Liu, M., Costa, AB., Titano, JJ. Oermann, EK. 2018. Confounding variables can degrade generalization performance of radiological deep learning models Confounding variables can degrade generalization performance of radiological deep learning models.
 Zhang . (2018) Zhang2018Zhang, Y., Miao, S., Mansi, T. Liao, R. 2018. Task Driven Generative Modeling for Unsupervised Domain Adaptation: Application to Xray Image Segmentation Task Driven Generative Modeling for Unsupervised Domain Adaptation: Application to Xray Image Segmentation. 21–9. 10.1007/9783030009342_67
 Zhong . (2010) ZhongZhong, E., Fan, W., Yang, Q., Verscheure, O. Ren, J. 2010. Cross validation framework to choose amongst models and datasets for transfer learning Cross validation framework to choose amongst models and datasets for transfer learning. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) ( 6323 LNAI, 547–562). 10.1007/9783642159398

Zhu . (2017)
zhu2017unpairedZhu, JY., Park, T., Isola, P. Efros, AA.
2017.
Unpaired imagetoimage translation using cycleconsistent adversarial networks Unpaired imagetoimage translation using cycleconsistent adversarial networks.
arXiv preprint.
Comments
There are no comments yet.