Unsupervised Intra-domain Adaptation for Semantic Segmentation through Self-Supervision

04/16/2020 ∙ by Fei Pan, et al. ∙ KAIST 수리과학과 20

Convolutional neural network-based approaches have achieved remarkable progress in semantic segmentation. However, these approaches heavily rely on annotated data which are labor intensive. To cope with this limitation, automatically annotated data generated from graphic engines are used to train segmentation models. However, the models trained from synthetic data are difficult to transfer to real images. To tackle this issue, previous works have considered directly adapting models from the source data to the unlabeled target data (to reduce the inter-domain gap). Nonetheless, these techniques do not consider the large distribution gap among the target data itself (intra-domain gap). In this work, we propose a two-step self-supervised domain adaptation approach to minimize the inter-domain and intra-domain gap together. First, we conduct the inter-domain adaptation of the model; from this adaptation, we separate the target domain into an easy and hard split using an entropy-based ranking function. Finally, to decrease the intra-domain gap, we propose to employ a self-supervised adaptation technique from the easy to the hard split. Experimental results on numerous benchmark datasets highlight the effectiveness of our method against existing state-of-the-art approaches. The source code is available at https://github.com/feipan664/IntraDA.git.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 7

page 8

Code Repositories

IntraDA

Unsupervised Intra-domain Adaptation for Semantic Segmentation through Self-Supervision (CVPR 2020 Oral)


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: We propose a two-step self-supervised domain adaptation technique for semantic segmentation. Previous works solely adapt the segmentation model from the source domain to the target domain. Our work also consider adapting from the clean map to the noisy map within the target domain.

Semantic segmentation aims at assigning each pixel in the image to a semantic class. Recently, convolutional neural network-based segmentation models [long2015fully, zhao2017pspnet]

have achieved remarkable progresses, leading to various applications in computer vision systems, such as autonomous driving 

[luc2017predicting, Zhang_2017_ICCV, lee2019visuomotor], robotics [milioto2018real, shvets2018automatic], and disease diagnosis [zhou2019collaborative, zhao2019data]. Training such a segmentation network requires large amounts of annotated data. However, collecting large scale datasets with pixel-level annotations for semantic segmentation is difficult since they are expensive and labor intensive. Recently, photorealistic data rendered from simulators and game engines [Richter_2016_ECCV, ros2016synthia] with precise pixel-level semantic annotations have been utilized to train segmentation networks. However, the models trained from synthetic data are hardly transferable to real data due to the cross-domain difference [pmlr-v80-hoffman18a]. To address this issue, unsupervised domain adaptation (UDA) techniques have been proposed to align the distribution shift between the labeled source data and the unlabeled target data. For the particular task of semantic segmentation, adversarial learning-based UDA approaches demonstrate efficiency in aligning features at the image [murez2018image, pmlr-v80-hoffman18a] or output [tsai2019domain, tsai2018learning] level. More recently, the entropy of pixel-wise output predictions proposed by [vu2019advent] is also used for output level alignment. Other approaches [Zou_2019_ICCV, zou2018unsupervised] involve generating pseudo labels for target data and conducting refinement via an iterative self-training process. While many models consider the single-source-single-target adaptation setting, recent works [peng2019moment, zhao2018adversarial] have proposed to address the issue of multiple source domains; it focus on the multiple-source-single-target adaptation setting. Above all, previous works have mostly considered adapting models from the source data to the target data (inter-domain gap).

However, target data collected from the real world have diverse scene distributions; these distributions are caused by various factors such as moving objects, weather conditions, which leads to a large gap in the target (intra-domain gap). For example, the noisy map and clean map in the target domain, shown in Figure 1, are the predictions made by the same model on different images. While previous studies solely focus on reducing the inter-domain gap, the problem of the intra-domain gap has attracted a relatively low attention. In this paper, we present a two-step domain adaptation approach to minimize the inter-domain and intra-domain gap. Our model consists of three parts, which are presented in Figure 2, namely, 1) an inter-domain adaptation module to close inter-domain gap between the labeled source data and the unlabeled target data, 2) an entropy-based ranking system to separate target data into the an easy and hard split, and 3) an intra-domain adaptation module to close intra-domain gap between the easy and hard split (using pseudo labels from the easy subdomain). For semantic segmentation, our proposed approach achieves good performance against state-of-the-art approaches on benchmark datasets. Furthermore, our approach outperforms previous domain adaptation approaches for digit classification.

The Contributions of Our Work. First, we introduce the inter-domain gap among target data and propose an entropy-based ranking function to separate target domain into an easy and hard subdomain. Second, we propose a two-step self-supervised domain adaptation approach to minimize the inter-domain and intra-domain gap together.

2 Related Works

Unsupervised Domain Adaptation. The goal of unsupervised domain adaptation is to align the distribution shift between the labeled source and the unlabeled target data. Recently, adversarial-based UDA approaches have shown great capabilities in learning domain invariant features, even for complex tasks like semantic segmentation [vu2019advent, chen2019domain, tsai2019domain, tsai2018learning, saito2018maximum, pmlr-v80-hoffman18a, park2019preserving]. Adversarial-based UDA models for semantic segmentation usually involve two networks. One network is used as a generator to predict the segmentation maps of input images, which can be from the source or the target. Given features from the generator, the second network functions as a discriminator to predict the domain labels. The generator tries to fool the discriminator, so as to align the distribution shift of the features from the two domains. Besides feature level alignment, other approaches try to align domain shift at the image level or output level. At the image level, CycleGAN [CycleGAN2017] was applied in [pmlr-v80-hoffman18a] to build generative images for domain alignment. At the output level, [tsai2018learning] proposes an end-to-end model involving structural output alignment for distribution shift. More recently, [vu2019advent] takes advantage of the entropy of pixel-wise predictions from the segmentation outputs to address the domain gap. While all the previous studies exclusively considered aligning the inter-domain gap, our approach further minimizes the intra-domain gap. Thus, our technique can be combined with most existing UDA approaches for extra performance gains.

Uncertainty via Entropy. Uncertainty measurement has a strong connection with unsupervised domain adaptation. For instance, [vu2019advent] proposes minimizing the target entropy value of the model outputs directly or using adversarial learning [tsai2018learning, pmlr-v80-hoffman18a] to close the domain gap for semantic segmentation. Also the entropy of the model outputs [wang2016cost] is used as a confidence measurement for transferring samples across domains [su2020active]. We propose utilizing entropy to rank target images to separate them into two an easy and hard split.

Curriculum Domain Adaptation. Our work is also related to curriculum domain adaptation [sakaridis2018model, Zhang_2017_ICCV, dai2019adaptation]

which deals with easy samples first. For curriculum domain adaptation on foggy scene understanding,  

[sakaridis2018model] proposes to adapt a semantic segmentation model from non-foggy images to synthetic light foggy images, and then to real heavy foggy images. To generalize this concept, [dai2019adaptation] decomposes the domain discrepancy into multiple smaller discrepancies by introducing unlabeled intermediate domains. However, these techniques require additional information to decompose domains. To cope with this limitation,  [Zhang_2017_ICCV] focuses on learning the global and local label distributions of images as the first task to regularize the model predictions in the target domain. In contrast, we propose a simpler and data-driven approach to learn the easy target samples based on an entropy ranking system.

Figure 2: The proposed self-supervised domain adaptation model contains the inter-domain generator and discriminator , and the intra-domain generator and discriminator . The proposed model consists of three parts, namely, (a) an inter-domain adaptation, (b) an entropy-based ranking system, and (c) an intra-domain adaptation. In (a), given the source and the unlabeled target data, is trained to predict the domain label for the samples while is trained to fool . are optimized by minimizing the segmentation loss and the adversarial loss . In (b), an entropy based function

is used separate all target data into easy split and hard split. An hyperparameter

is introduced as a ratio of target images assigned into the easy split. In (c), an intra-domain adaptation is used to close the gap between easy split and hard split. The segmentation predictions of easy split data from serve as pseudo labels. Given easy split data with pseudo labels and hard split data, is used to predict whether the sample is from easy split or hard split, while is trained to confuse . and are optimized using the intra-domain segmentation loss and the adversarial loss .

3 Approach

Let denote a source domain consisting of a set of images with their associated ground-truth -class segmentation maps ; similarly, let denote a target domain containing a set of unlabeled images . In this section, a two-step self-supervised domain adaptation for semantic segmentation is introduced. The first step is the inter-domain adaptation, which is based on common UDA approaches [vu2019advent, tsai2018learning]. Then, the pseudo labels and predicted entropy maps of target data are generated such that the target data can be clustered into an easy and hard split. Specifically, an entropy-based ranking system is used to cluster the target data into the easy and hard split. The second step is the intra-domain adaptation, which consists in aligning the easy split with pseudo labels to the hard split, as shown in Figure 2. The proposed network consists of the inter-domain generator and discriminator , and the intra-domain generator and discriminator .

3.1 Inter-domain Adaptation

A sample is from the source domain with its associated map . Each entry of provides a label of a pixel

as a one-hot vector. The network

takes as an input and generates a “soft-segmentation map” . Each -dimensional vector at a pixel serves as a discrete distribution over classes. Given with its ground-truth annotation , is optimized in a supervised way by minimizing the cross-entropy loss:

(1)

To close the inter-domain gap between the source and target domains, [vu2019advent] proposes to utilize entropy maps in order to align the distribution shift of the features. The assumption of [vu2019advent] is that the trained models tend to produce over-confident (low-entropy) predictions for source-like images, and under-confident (high-entropy) predictions for target-like images. Due to its simplicity and effectiveness, [vu2019advent] is adopted in our work to conduct inter-domain adaptation. The generator takes a target image as an input and produce the segmentation map ; the entropy map is formulated as:

(2)

To align the inter-domain gap, is trained to predict the domain labels for the entropy maps, while is trained to fool ; the optimization of and

is achieved via the following loss function:

(3)

where is the entropy map of . The loss function and are optimized to align the distribution shift between the source and target data. However, there remains a need for an efficient method that can minimize the intra-domain gap. For this purpose, we propose to separate the target domain into an easy and hard split and to conduct an intra-domain adaptation.

3.2 Entropy-based Ranking

Target images collected from the real world have diverse distributions due to various weather conditions, moving objects, and shading. In Figure 2, some target prediction maps are clean111The prediction map is clean means that the prediction is confident and smooth. and others are very noisy, despite being generated from the same model. Since the intra-domain gap exists among target images, a straightforward solution is to decompose the target domain into small subdomains/splits. However, it remains a challenging task due to the lack of target labels. To build these splits, we take advantage of the entropy maps in order to determine the confidence levels of the target predictions. The generator takes a target image as input to generate and the entropy map . On this basis, we adopt a simple yet effective way for ranking by using:

(4)

which is the mean value of entropy map . Given a ranking of scores from , hyperparameter is introduced as a ratio to separate the target images into an easy and hard split. Let and denote a target image assigned to the easy and hard split, respectively. In order to conduct domain separation, we define , where is the cardinality of the easy split, and is the cardinality of the whole target image set. To access the influence of , we conduct an ablation study on how to optimize in Table 3. Note that we do not introduce a hyperparameter as the threshold value for separation. The reason is that the threshold value is dependent on a specific dataset. However, we choose a hyperparameter as ratio, which shows strong generalization to other datasets.

3.3 Intra-domain Adaptation

Since no annotation is available for the easy split, directly aligning the gap between the easy and hard split is infeasible. But we propose to utilize the predictions from as pseudo labels. Given an image from the easy split , we forward to and obtain the prediction map . While is a “soft-segmentation map”, we convert to where each entry is a one-hot vector. With the aid of pseudo labels, is optimized by minimizing the cross-entropy loss:

(5)

To bridge the intra-domain gap between easy and hard split, we adopt the alignment on the entropy map for both splits. An image from hard split is taken as input to the generator to generate the segmentation map and the entropy map . To close the inter-domain gap, the intra-domain discriminator is trained to predict the split labels of and : is from the easy split, and is from the hard split. is trained to fool . The adversarial learning loss to optimize and is formulated as:

(6)

Finally, our complete loss function is formed by all the loss functions:

(7)

and our objective is to learn a target model according to:

(8)

Since our proposed model is two-step self-supervised approach, it is difficult to minimize in one training stage. Thus, we choose to minimize it in three stages. First, we train the inter-domain adaptation for the model to optimize and . Second, we generate target pseudo labels by utilizing and rank all target images based on . Finally, we train the intra-domain adaptation to optimize and .

(a) GTA5 Cityscapes
Method

road

sidewalk

building

wall

fence

pole

light

sign

veg

terrain

sky

person

rider

car

truck

bus

train

mbike

bike

mIoU
Without adaptation [tsai2018learning] 75.8 16.8 77.2 12.5 21.0 25.5 30.1 20.1 81.3 24.6 70.3 53.8 26.4 49.9 17.2 25.9 6.5 25.3 36.0 36.6
ROAD [chen2018road] 76.3 36.1 69.6 28.6 22.4 28.6 29.3 14.8 82.3 35.3 72.9 54.4 17.8 78.9 27.7 30.3 4.0 24.9 12.6 39.4
AdaptSegNet [tsai2018learning] 86.5 36.0 79.9 23.4 23.3 23.9 35.2 14.8 83.4 33.3 75.6 58.5 27.6 73.7 32.5 35.4 3.9 30.1 28.1 42.4
MinEnt [vu2019advent] 84.2 25.2 77.0 17.0 23.3 24.2 33.3 26.4 80.7 32.1 78.7 57.5 30.0 77.0 37.9 44.3 1.8 31.4 36.9 43.1
AdvEnt [vu2019advent] 89.9 36.5 81.6 29.2 25.2 28.5 32.3 22.4 83.9 34.0 77.1 57.4 27.9 83.7 29.4 39.1 1.5 28.4 23.3 43.8
Ours 90.6 37.1 82.6 30.1 19.1 29.5 32.4 20.6 85.7 40.5 79.7 58.7 31.1 86.3 31.5 48.3 0.0 30.2 35.8 46.3
(b) SYNTHIA Cityscapes
Method

road

sidewalk

building

wall

fence

pole

light

sign

veg

sky

person

rider

car

bus

mbike

bike

mIoU mIoU
Without adaptation [tsai2018learning] 55.6 23.8 74.6 9.2 0.2 24.4 6.1 12.1 74.8 79.0 55.3 19.1 39.6 23.3 13.7 25.0 33.5 38.6
AdaptSegNet [tsai2018learning] 81.7 39.1 78.4 11.1 0.3 25.8 6.8 9.0 79.1 80.8 54.8 21.0 66.8 34.7 13.8 29.9 39.6 45.8
MinEnt [vu2019advent] 73.5 29.2 77.1 7.7 0.2 27.0 7.1 11.4 76.7 82.1 57.2 21.3 69.4 29.2 12.9 27.9 38.1 44.2
AdvEnt [vu2019advent] 87.0 44.1 79.7 9.6 0.6 24.3 4.8 7.2 80.1 83.6 56.4 23.7 72.7 32.6 12.8 33.7 40.8 47.6
Ours 84.3 37.7 79.5 5.3 0.4 24.9 9.2 8.4 80.0 84.1 57.2 23.0 78.0 38.1 20.3 36.5 41.7 48.9
(c) Synscapes Cityscapes
Method

road

sidewalk

building

wall

fence

pole

light

sign

veg

terrain

sky

person

rider

car

truck

bus

train

mbike

bike

mIoU
Without adaptation 81.8 40.6 76.1 23.3 16.8 36.9 36.8 40.1 83.0 34.8 84.9 59.9 37.7 78.4 20.4 20.5 7.8 27.3 52.5 45.3
AdaptSegNet [tsai2018learning] 94.2 60.9 85.1 29.1 25.2 38.6 43.9 40.8 85.2 29.7 88.2 64.4 40.6 85.8 31.5 43.0 28.3 30.5 56.7 52.7
Ours 94.0 60.0 84.9 29.5 26.2 38.5 41.6 43.7 85.3 31.7 88.2 66.3 44.7 85.7 30.7 53.0 29.5 36.5 60.2 54.2
Table 1: The semantic segmentation results of Cityscapes validation set with models trained on GTA5 (a), SYNTHIA (b), and Synscapes (c). All the results are generated from the ResNet-101-based models. In the experiments of (a) and (b), AdvEnt[vu2019advent] is used as the framework for the inter-domain adaptation and intra-domain adaptation. In the experiment of (c), AdaptSegNet [tsai2018learning] is used as the framework of the inter-domain adaptation and intra-domain adaptation. mIoU in (b) denotes the mean IoU of 13 classes, excluding the classes with .

4 Experiments

In this section, we introduce the experimental details of the inter-domain and the intra-domain adaptation on semantic segmentation.

4.1 Datasets

In the experiments of semantic segmentation, we adopt the setting of adaptation from the synthetic to the real domain. To conduct this series of tests, synthetic datasets including GTA5 [Richter_2016_ECCV], SYNTHIA [ros2016synthia] and Synscapes [wrenninge2018synscapes] are used as source domains, along with the real-world dataset Cityscapes [Cordts2016Cityscapes] as the target domain. Models are trained given labeled source data and unlabeled target data. Our model is evaluated on Cityscapes validation set.

  • GTA5: The synthetic dataset GTA5 [Richter_2016_ECCV] contains 24,966 synthetic images with a resolution of 1,9141,052 and corresponding ground-truth annotations. These synthetic images are collected from a video game based on the urban scenery of Los Angeles city. The ground-truth annotations generated automatically contain 33 categories. For training, we consider only 19 categories which are compatible with the Cityscapes dataset [Cordts2016Cityscapes], similarly to previous work.

  • SYNTHIA: SYNTHIA-RAND-CITYSCAPES [ros2016synthia] is used as another synthetic dataset. It contains 9,400 fully annotated RGB images. During the training time, we consider the 16 common categories with the Cityscapes dataset. During evaluation, 16- and 13- class subsets are used to evaluate the performance.

  • Synscapes: Synscapes [wrenninge2018synscapes] is a photorealistic synthetic dataset consisting of 25,000 fully annotated RGB images with a resolution of 1,440720. Alike Cityscape, the ground-truth annotations contain 19 categories.

  • Cityscapes: As the dataset collected from real world, Cityscapes [Cordts2016Cityscapes] provides 3,975 images with fine segmentation annotations. 2,975 images are taken from the training set of Cityscapes to be used for training. The 500 images from the evaluation set of Cityscapes are used to evaluate the performance of our model.

Evaluation. The semantic segmentation performance is evaluated on every category using the PASCAL VOC intersection-over-union metric, i.e.,  [everingham2015pascal], where TP, FP, and FN are the number of true positive, false positive, and false negative pixels, respectively.

Implementation Details. In the experiments for GTA5Cityscapes and SYNTHIACityscapes, we utilize the framework of AdvEnt [vu2019advent] to train and for inter-domain adaptation; the backbone of is a ResNet-101 architecture [he2016deep]

with pretrained parameters from ImageNet 

[deng2009imagenet]; the input data are labeled source images and unlabeled target images. The model for inter-domain adaptation is trained for 70,000 iterations. After training, is used to generate the segmentation and entropy maps for all 2,975 images from Cityscapes training set. Then, we utilize to get the ranking scores for all target images and to separate them into the easy and hard split based on . We conduct an ablation study of for optimization in Table 3. For the intra-domain adaptation, has same architecture as , and same as ; the input data are 2,975 Cityscapes training images with pseudo labels of easy split. is trained with the pretrained parameters from ImageNet and from scratch, similar to AdvEnt. In addition to the previously mentioned experiments, we also conduct the experiment for SynscapesCityscapes. For comparison with AdaptSegNet [tsai2018learning], We apply the framework of AdaptSegNet in the experiment for the inter-domain and intra-domain adaptation.

Similarly to [vu2019advent] and [tsai2018learning], we utilize the multi-level feature outputs from conv4 and conv5 for inter-domain adaptation and intra-domain adaptation. To train and , we apply an SGD optimizer [bottou2010large] with a learning rate of , momentum , and a weight decay for training and . An Adam optimizer [kingma2014adam] with a learning rate of is used for training and .

Figure 3: The example results of evaluation for GTA5Cityscapes. (a) and (d) are the images from Cityscapes validation set and the corresponding ground-truth annotation. (b) are the predicted segmentation maps of the inter-domain adaptation [vu2019advent]. (c) are the predicted maps from our technique.
Figure 4: The examples of entropy maps from hard split for GTA5Cityscapes. (a) are the hard images from Cityscapes training set. (b) and (c) are the predicted entropy and segmentation maps from model trained solely by the inter-domain adaptation [vu2019advent]. (d) are the improved predicted segmentation results of the hard images from our model.
Model mIoU
AdvEnt [25] 43.8
AdvEnt + intra-domain adaptation 45.1
AdvEnt self-training 45.5
Ours 46.3
Ours + entropy normalization 47.0
Table 2: The self-training and intra-domain adaptation gain on GTA5 Cityscapes.
GTA5 Cityscapes
0.0 0.5 0.6 0.67 0.7 1.0
mIoU 43.8 45.2 46.0 46.3 45.6 45.5
Table 3: The ablation study on hyperparameter for separating the target domain into the easy and the hard split.

4.2 Results

GTA5. In Table 1 (a), we compare the segmentation performance of our method with other state-of-the-art methods [tsai2018learning, chen2018road, vu2019advent] on Cityscapes validation set. For a fair comparison, the baseline model is adopted from DeepLab-v2[chen2017deeplab] with a ResNet-101 backbone. Overall, our proposed method achieves in mean IoU. Compared to AdvEnt, the intra-domain adaptation from our method leads to a improvement in mean IoU.

To highlight the relevance of the proposed intra-domain adaptation, we conduct a comparison with segmentation loss and adversarial adaptation loss in Table 2. The baseline AdvEnt [vu2019advent] achieves of mIoU. By using AdvEnt + intra-domain adaptation, which means , we obtain , showing the effectiveness of adversarial learning for the intra-domain alignment. By applying AdvEnt + self-training, with (all pseudo labels used for self-training), which means , we achieve of mIoU, underlying the importance of employing pseudo labels. Finally, our proposed model achieves of mIOU (self-training + intra-domain alignment).

Admittedly, complex scenes (containing many objects) might be categorized as “hard”. To provide a more representative “ranking”, we adopt a new normalization by dividing the mean entropy with the number of predicted rare classes in the target image. For Cityscapes dataset, we define these rare classes as “wall, fence, pole, traffic light, traffic sign, terrain, rider, truck, bus, train, motor”. The entropy normalization helps to move images with many objects to the easy split. By using the normalization, our proposed model achieves of mIoU, as shown in Table 2. Our proposed method also has limitation for some classes.

In Figure 3, we provide some visualizations of segmentation maps from our technique. The segmentation maps generated from our model trained with inter-domain alignment and intra-domain alignment are more accurate than the baseline model AdvEnt, which has been only trained with inter-domain alignment. A representative set of images belonging to the “hard” split are visible in Figure 4. After intra-domain alignment, we produce the segmentation maps shown in the (d) column. Compared with (c) column, our model can be transferred to more difficult target images.

Analysis of Hyperparameter . We conduct a study on finding a proper value for the hyperparameter in our experiment of GTA5Cityscapes. In Table 3, different values of are used for setting up the decision boundary for domain separation. When , i.e., the ratio of to is approximately , the model achieves of mIoU as the best performance on Cityscapes validation set.

SYNTHIA. We use SYNTHIA as the source domain and present evaluation results of the proposed method and state-of-the-art methods [tsai2018learning, vu2019advent] on Cityscapes validation set in the Table 1. For a fair comparison, we also adopt the same DeepLab-v2 with the ResNet-101 architecture. Our method is evaluated on both 16-class and 13-class baselines. According to the results in Table 1 (b), our proposed method has achieved and of mean IoU on 16-class and 13-class baseline, respectively. As shown in the Table 1, our model is significantly more accurate on the car and motor bike classes than existing techniques. The reason is that we apply the intra-domain adaptation to further narrow the domain gap.

Synscapes. The only work that we currently have found using Syncapes dataset is  [tsai2018learning]. Thus we use AdaptSegNet [tsai2018learning] as our baseline model. In order to present a fair comparison, We only consider using vanilla-GAN in our experiments. With inter-domain and intra-domain adaptation, our model achieves of mIoU, which higher than AdaptSegNet shown in Table 1 (c).

4.3 Discussion

Theoretical Analysis. Comparing (a), (b) in Table 1, GTA5 to Cityscapes is more effective than SYNTHIA to Cityscapes. We believe the reason is that GTA5 has more similar images of street scenes with Cityscapes than other synthetic datasets. We also provide a theoretical analysis. Let denote the hypothesis class, and be the source and the target domain. The theory from  [ben2007analysis] proposes to bound the expected error on the target domain : , where is the expected error on the source domain, is the distance for domain divergence, and is considered as a constant in normal cases. Therefore, is upper bounded by and in our case. Our proposed model is to minimize by using the inter-domain and the intra-domain alignment together. If has high value, the higher upper bound in the first stage of the inter-domain adaptation affects our entropy ranking system, and the intra-domain adaptation processes. Therefore, our model is less efficient in big domain gap. With respect to the limitation, our model performance is affected by and . Firstly, the larger divergence of the source and target domain leads to higher value in . The upper bound of error is higher so our model would be less effective. Secondly, would be very high when the model uses small neural networks. In such case, our model would also be less effective.

Model MNIST USPS USPS MNIST SVHN MNIST
Source only 82.2 0.8 69.6 3.8 67.1 0.6
ADDA [tzeng2017adversarial] 89.4 0.2 90.1 0.8 76.0 1.8
CyCADA [pmlr-v80-hoffman18a] 95.6 0.2 96.5 0.1 90.4 0.4
Ours 95.80.1 97.80.1 95.10.3
Table 4: The experimental results of adaptation across digit datasets.

Digit Classification. Our model are also capable to be applied in digit classification task. We consider the adaptation shift of MNISTUSPS, USPSMNIST, and SVHNMNIST. Our model is trained using the training sets: MNIST with 60,000 images, USPS with 7,291 images, standard SVHN with 73,257 images. The proposed model is evaluated on the standard test sets: MNIST with 10,000 images and USPS with 2,007 images. In digit classification task, and

serve as classifiers with same architecture, which is based on a variant of the LeNet architecture. In inter-domain adaptation, We utilize the framework of CyCADA 

[pmlr-v80-hoffman18a] to train and . In the ranking stage, we utilize to generate the predictions of all target data and compute their ranking score using . With respect to , we adopted in all experiments. Our network for intra-domain adaptation is also based on CyCADA [pmlr-v80-hoffman18a]. In Table 4, our proposed model achieve 95.80.1 of accuracy on MNIST USPS, 97.80.1 of accuracy on USPS MNIST, and 95.10.3 on SVHN MNIST. Our model outperforms the baseline model CyCADA [pmlr-v80-hoffman18a].

5 Conclusion

In this paper, we present a self-supervised domain adaptation to minimize the inter-domain and intra-domain gap simultaneously. We first train the model using the inter-domain adaptation from existing approaches. Secondly, we produce target image entropy maps and use an entropy-based ranking functions to split the target domain. Lastly, we conduct the intra-domain adaptation to further narrow the domain gap. We conduct extensive experiments on synthetic to real images in traffic scenarios. Our model can be combined with existing domain adaptation approaches. Experimental results shows that our model outperforms existing adaptation algorithms.

Acknowledgments

This research was partially supported by the Shared Sensing for Cooperative Cars Project funded by Bosch (China) Investment Ltd. This work was also partially supported by the Korea Research Fellowship Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT and Future Planning (2015H1D3A1066564).

References