Many eye diseases can be revealed by the morphology of optic disc (OD) and optic cup (OC). For instance, glaucoma is usually characterized by the large cup to disc ratio (CDR), the ratio of the vertical diameter of the cup to the vertical diameter of the disc. Currently, determining CDR is mainly performed by pathology specialists. However, it is extremely expensive to accurately calculate CDR by human experts. Furthermore, manual delineation of these lesions also introduces subjectivity, intra- and inter-variability. Therefore, it is essential to automate the process of calculating CDR. OD and OC segmentation are commonly adopted to automatically calculate the CDR. Nevertheless, both OD and OC segmentation is challenging due to the pathological lesions on the boundaries or some regions overlapping with blood vessels.
Recently, deep learning based methods have been proposed to overcome these challenges and some of them, e.g., M-Net , have demonstrated impressive results. Although these methods tend to perform well when being applied to well-annotated datasets, the segmentation performance of a trained network may degrade severely on datasets with different distributions, particularly for the retinal fundus images captured with different imaging devices (e.g., different cameras, as illustrated in Fig. 1
). The variance among the diverse data domains limits deep learning’s deployment in reality and impedes us from building a robust application for retinal fundus image parsing.
To tackle this challenge, existing works have mainly focused on minimizing the distance between the source and target domains to align the latent feature distributions of the different domains . However, adversarial discriminative learning usually suffers the instability of its training. Numerous methods have been studied to tackle this challenge. Self-ensembling  is one of them recently applied to visual domain adaptation . In particular, gradient descent is used to train the student network, and the exponential moving average of the weights of the student network is transferred to the teacher network after applying each training sample. The mean square difference between the outputs of the student and the teacher is used as the unsupervised loss to train the student network.
In this paper, we propose a novel unsupervised domain adaptation framework, called Collaborative Feature Ensembling Adaptation (CFEA), to further overcome the challenges underlining in domain shift. In particular, we take the advantage of the self-ensembling, which is the time-dependent weighting to the unsupervised loss for each unlabeled sample, to stabilize the adversarial discriminative learning . Most importantly, we apply the unsupervised loss by adversarial learning not only to the output space but also to the input space or the intermediate representations of the network. Thus, from a complementary perspective, adversarial learning can consistently provide various model space and time-dependent weights to self-ensembling to accelerate the learning of the domain invariant features and further enhance the stabilization of adversarial learning, forming a benign collaborative circulation and unified framework.
The significant contributions of this paper are: (a) We propose the CFEA, a novel unsupervised domain adaptation framework, that exploits collaborative adversarial learning and self-ensembling for feature adaptation to tackle domain shift in a mutual benefit and complementary manner, thus leading to a robust and accurate model. (b) We intensify feature adaptation by applying adversarial discriminative learning in two phases of the network, i.e., intermediate representation space and output space. (c) We evaluate the effectiveness of our CFEA on the challenging task of the unsupervised joint segmentation of retinal OD and OC. Our CFEA model can overcome performance degradation to domain shift and outperform the state-of-the-art methods.
2 Collaborative Feature Ensembling Adaptation
2.1 Problem Formulation
Unsupervised domain adaptation typically refers to the scenario: given a labeled source domain dataset with distribution and the corresponding label with distribution , as well as a target dataset with distribution and unknown label with distribution , where , the goal is to train a model from both labeled data and unlabeled data , with which the expected model distribution is close to .
2.2 Overview of the Proposed Method
As illustrated in Fig. 2
, our framework mainly includes three networks, i.e., the source domain network (SN, in blue), the target domain student network (TSN, in gray) and the target domain teacher network (TTN, in orange). Although each of the networks plays a distinctive role in guiding networks to learn domain invariant representations, all of them can interact with each other, benefit to one another, and work collaboratively as a unified framework during an end-to-end training process. SN and TSN focus on supervised learning for labeled samples from the source domain () and adversarial discriminative learning for unlabeled samples from the target domain (
), separately. More importantly, we allow SN and TSN to share the weights that are sequentially learned from both labeled and unlabeled samples. The labeled samples enable the network to learn accurate segmentation predictions while the unlabeled ones bring unsupervised learning and further present a type of perturbation to regularize the model training. Furthermore, TTN conducts the weight self-ensembling part with replicating the average weights of the TSN instead of predictions. TTN solely takes unlabeled target images as input and then the mean square difference between TSN and TTN is computed for the same target sample. Different data augmentations (e.g., adding Gaussian noise and random intensity or brightness scaling) are applied to TSN and TTN to avoid loss vanishing issue.
Basically, the U-Net  with encoder-decoder structure is employed as the backbone of each network. Since U-Net is one of the most successful segmentation frameworks in medical imaging, we expect that the results can easily generalize to other medical image analysis tasks.
2.3 Adversarial Discriminative Learning
We apply two discriminators at the encoder and decoder of the networks, separately, to achieve adversarial discriminative learning. Two adversarial loss functions are calculated between SN and TSN. Each of the loss calculations is performed by two steps in each training iteration: (1) train the target domain segmentation network to maximize the adversarial loss, thereby fooling the domain discriminator D
to maximize the probability of the source domain featurebeing classified as target features:
and (2) minimize the discrininator loss :
where is the target domain feature.
In self-ensembling for domain adaptation, the training of the student model is iteratively improved by the task-specific loss, a moving average (EMA) model (teacher) of the student model, which can be illustrated as:
where and denote the paramters of the student network and the teacher network, respectively.
More specifically, at each iteration, a mini-batch of labeled source domain and unlabeled target samples are drawn from the target domain . Then, the EMA predictions and the base predictions are generated by the teacher model and the student model respectively with different augmentation applied to the target samples. Afterward, a mean-squared error (MSE) loss between the EMA and target predictions is calculated. Finally, the MSE loss together with the task-specific loss on the labeled source domain data is minimized to update the parameters of the student network. Since the teacher model is an improved model at each iteration, the MSE loss helps the student model to learn from the unlabeled target domain images. Therefore, the student model and teacher model can work collaboratively to achieve robust and accurate predictions.
2.5 CFEA Unsupervised Domain Adaptation
Unlike existing methods, our method appropriately integrates adversarial domain confusion and self-ensembling with an encoder-decoder architecture.
Adversarial Feature Adaptation: Adversarial domain confusion is applied to both the encoded features and decoded predictions between source domain network (SN) and target domain student network (TSN) to reduce the distribution differences. According to Eq. 1 and 2, this corresponds to the adversarial loss function for the encoder output of SN and TSN, and the adversarial loss function for the decoder output of SN and TSN:
where and are the encoder and decoder outputs, respectively. and are the width and height of the decoders’ output; refers to pixel categories of the segmentation result, which is three in our cases. , , and are the width, height, channel of the encoders’ output. and are the discriminator networks for the encoder and decoder outputs, respectively.
The discriminator loss for the encoder feature and the discriminator loss for decoder feature are as follows:
where is the encoder output and is the decoder output of TSN.
Collaborative Adaptation with Self-ensembling: Self-ensembling is also applied to both the encoded features and decoded predictions between TSN and target domain teacher network (TTN). In this work, MSE is used for the self-ensembling. The MSE loss between encoder outputs of TSN and TTN, and the MSE loss between decoder outputs of TSN and TTN are as follows:
where , , , and denote the element of the flattened predictions (, , , and ) of the student encoder, student decoder, teacher encoder, teacher decoder, respectively. and are the number of elements in the encoder feature and decoder output, respectively.
The same spatial-challenging augmentation is used for both the teacher and student at each iteration with applied to the training sample of the student and applied to the predictions of the teacher, where is the transformation parameter.
Total Objective Function: Finally, we use the dice loss as the segmentation loss for labeled images from the source domain. Combing Eq. 4, 5, 6, 7, 8, and 9, the total loss is obtained, which can be formulated as below.
where , , , and balance the weights of the losses. They are cross-validated in our experiments. is the dice segmentation loss. Based on Eq. 10, we optimize the following min-max problem:
where and are the source domain network with trainable weight and target domain network with trainable weight .
3 Experiments and Results
Data: Extensive experiments are conducted on the REFUGE111https://refuge.grand-challenge.org/Home/ dataset to validate the effectiveness of the proposed method. The dataset includes 400 source domain retinal fundus images (supervised training dataset) with size , acquired by a Zeiss Visucam 500 camera, 400 labeled (testing dataset) and 400 additional unlabeled (unsupervised training dataset) target domain retinal fundus images with size collected by a Canon CR-2 camera. As different cameras are used, the source and target domain images have totally distinct appearances (e.g., color and texture). The optic disc and optical cup regions were carefully delineated by the experts. All of the methods in this section are supervised by the annotations of the source domain and evaluated by the disc and cup dice indices (DI), and the cup-to-disc ratio (CDR) on the target domain.
Data Preprocessing: Firstly, we detect the center of optic disc by pre-trained disc-aware ensemble network , and then center and crop optic disc regions with a size of for supervised training dataset and for unsupervised training dataset and test dataset. This is due to the different sizes of images acquired by the two cameras. During training, all images are resized to a small size of in order to adapt the network’s receptive field.
The U-Net is used for both student and teacher network. All experiments are processed on Python v2.7, and PyTorch with GEFORCE GTX TITAN GPUs.
Adaptation to different fundus cameras: We trained our CFEA on the source domain data acquired by Zeiss Visucam 500 camera in a supervised manner and on the target domain data acquired by Canon CR-2 camera in an unsupervised manner, simultaneously. We then evaluated our fully trained segmentation network on the test dataset, which includes 400 retinal fundus images acquired by Canon CR-2 camera. To demonstrate our method’s effectiveness, we trained the segmentation network on source domain data only in a supervised manner and then tested it on the test data. In addition, we also trained the baseline-AdaptSegNet  in the same way of training our method. AdaptSegNet  represents one of the state-of-the-art unsupervised domain adaptation methods for image segmentation, which also spplies adversarial learning for domain adaptation. The main result is shown in Table 1. The model trained on source data completely fails for target data. The baseline can have satisfied results on target data. By comparing our model with the baseline, as one can see, our model outperforms the state-of-the-art method consistently for OD, OC, and CDR. These results indicate the proposed framework has a capability of overcoming domain shifts, thus allowing us to build a robust and accurate model.
|Evaluation-Index||Source only||AdaptSegNet ||CFEA(Ours)|
4 Discussions and Conclusions
In this work, we propose a novel method CFEA for unsupervised domain adaptation of cross a diversity of retinal fundus imaging cameras. Our CFEA framework collaboratively combines adversarial discriminative learning and self-ensembling to obtain domain-invariant feature. Self-ensembling can stabilize the adversarial learning and prevent the network from getting stuck in a sub-optimal solution. From a complementary perspective, adversarial learning can consistently provide various model space and time-dependent weights to self-ensembling to accelerate the learning of the domain invariant features and further enhance the stabilization of adversarial learning, forming a benign collaborative circulation and unified framework. The collaborative mutual benefits from both adversarial feature learning and ensembling weights during an end-to-end learning process lead to a robust and accurate model. Experimental results demonstrate the superiority of our network over the state-of-the-art method. Our framework needs relatively higher computational costs during the training stage to help the segmentation network to adapt to the target domain. However, in the testing stage, the computational costs will be the same as a normal U-Net network, as the images only need to go through the TTN network. Our approach is general and can be easily extended to other unsupervised domain adaptation problems. For the future work, we will conduct the extensive ablation study of the student and teacher network and the verification study of weight sharing between the SN and TSN networks.
Research reported in this publication is partially supported by the National Science Foundation under Grant No. IIS-1564892, the University of Florida Informatics Institute Junior SEED Program (00129436), the University of Florida Informatics Institute SEED Funds, and the UF Clinical and Translational Science Institute, which is supported in part by the NIH National Center for Advancing Translational Sciences under award number UL1 TR001427. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health and the National Science Foundation.
-  (2017) Self-ensembling for visual domain adaptation. arXiv preprint arXiv:1706.05208. Cited by: §1.
-  (2018) Joint optic disc and cup segmentation based on multi-label deep network and polar transformation. IEEE Transactions on Medical Imaging (TMI). Cited by: §1, §3.
Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242. Cited by: §1.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi (Eds.), Cham, pp. 234–241. Cited by: §2.2.
-  (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in Neural Information Processing Systems, pp. 1195–1204. Cited by: §2.2.
-  (2018) Learning to adapt structured output space for semantic segmentation. In , pp. 7472–7481. Cited by: Figure 3, Table 1, §3.
-  (2017) Adversarial discriminative domain adaptation. In Proceedings of the IEEE International Conference on Computer Vision (CVPR), Vol. 1, pp. 4. Cited by: §1.