Multi-source Domain Adaptation for Semantic Segmentation

10/27/2019 ∙ by Sicheng Zhao, et al. ∙ 12

Simulation-to-real domain adaptation for semantic segmentation has been actively studied for various applications such as autonomous driving. Existing methods mainly focus on a single-source setting, which cannot easily handle a more practical scenario of multiple sources with different distributions. In this paper, we propose to investigate multi-source domain adaptation for semantic segmentation. Specifically, we design a novel framework, termed Multi-source Adversarial Domain Aggregation Network (MADAN), which can be trained in an end-to-end manner. First, we generate an adapted domain for each source with dynamic semantic consistency while aligning at the pixel-level cycle-consistently towards the target. Second, we propose sub-domain aggregation discriminator and cross-domain cycle discriminator to make different adapted domains more closely aggregated. Finally, feature-level alignment is performed between the aggregated domain and target domain while training the segmentation network. Extensive experiments from synthetic GTA and SYNTHIA to real Cityscapes and BDDS datasets demonstrate that the proposed MADAN model outperforms state-of-the-art approaches. Our source code is released at: https://github.com/Luodian/MADAN.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Semantic segmentation assigns a semantic label (e.g.

car, cyclist, pedestrian, road) to each pixel in an image. This computer vision kernel plays a crucial role in many applications, ranging from autonomous driving 

Geiger et al. (2012) and robotic control Hong et al. (2018) to medical imaging Çiçek et al. (2016) and fashion recommendation Jaradat (2017)

. With the advent of deep learning, especially convolutional neural networks (CNNs), several end-to-end approaches have been proposed for semantic segmentation 

Long et al. (2015a); Liu et al. (2015); Zheng et al. (2015); Lin et al. (2016); Yu and Koltun (2016); Badrinarayanan et al. (2017); Zhao et al. (2017); Chen et al. (2017); Wang et al. (2018); Zhou et al. (2019). Although these methods have achieved promising results, they suffer from some limitations. On the one hand, training these methods requires large-scale labeled data with pixel-level annotations, which is prohibitively expensive and time-consuming to obtain. For example, it takes about 90 minutes to label each image in the Cityscapes dataset Cordts et al. (2016). On the other hand, they cannot well generalize their learned knowledge to new domains, because of the presence of domain shift or dataset bias Torralba and Efros (2011); Wu et al. (2019).

To sidestep the cost of data collection and annotation, unlimited amounts of synthetic labeled data can be created from simulators like CARLA and GTA-V Richter et al. (2016); Ros et al. (2016); Yue et al. (2018), thanks to the progress in graphics and simulation infrastructure. To mitigate the gap between different domains, domain adaptation (DA) or knowledge transfer techniques have been proposed Patel et al. (2015) with both theoretical analysis Ben-David et al. (2010); Gopalan et al. (2014); Louizos et al. (2015); Tzeng et al. (2017) and algorithm design Pan and Yang (2010); Glorot et al. (2011); Jhuo et al. (2012); Becker et al. (2013); Ghifary et al. (2015); Long et al. (2015b); Hoffman et al. (2018). Besides the traditional task loss on the labeled source domain, deep unsupervised domain adaptation (UDA) methods are generally trained with another loss to deal with domain shift, such as a discrepancy loss Long et al. (2015b); Sun et al. (2016, 2017); Zhuo et al. (2017), adversarial loss Goodfellow et al. (2014); Bousmalis et al. (2017); Liu and Tuzel (2016); Zhu et al. (2017); Bousmalis et al. (2017); Zhao et al. (2018b); Russo et al. (2018); Sankaranarayanan et al. (2018a); Hu et al. (2018); Hoffman et al. (2018); Zhao et al. (2019), reconstruction loss Ghifary et al. (2015, 2016); Bousmalis et al. (2016), etc. Current simulation-to-real DA methods for semantic segmentation Hoffman et al. (2016); Zhang et al. (2017); Peng et al. (2017); Chen et al. (2018); Sankaranarayanan et al. (2018b); Zhang et al. (2018); Hoffman et al. (2018); Dundar et al. (2018); Zhu et al. (2018); Wu et al. (2018); Yue et al. (2019) all focus on the single-source setting and do not consider a more practical scenario where the labeled data are collected from multiple sources with different distributions. Simply combining different sources into one source and directly employing single-source DA may not perform well, since images from different source domains may interfere with each other during the learning process Riemer et al. (2019).

Early efforts on multi-source DA (MDA) used shallow models Sun et al. (2015); Duan et al. (2009); Sun et al. (2011); Duan et al. (2012a); Chattopadhyay et al. (2012); Duan et al. (2012b); Yang et al. (2007); Schweikert et al. (2009); Xu and Sun (2012); Sun and Shi (2013). Recently, some multi-source deep UDA methods have been proposed which only focus on image classification Xu et al. (2018); Zhao et al. (2018a); Peng et al. (2018). Directly extending these MDA methods from classification to segmentation may not perform well due to the following reasons. (1) Segmentation is a structured prediction task, the decision function of which is more involved than classification because it has to resolve the predictions in an exponentially large label space Zhang et al. (2017); Tsai et al. (2018). (2) Current MDA methods mainly focus on feature-level alignment, which only aligns high-level information. This may be enough for coarse-grained classification tasks, but it is obviously insufficient for fine-grained semantic segmentation, which performs pixel-wise prediction. (3) These MDA methods only align each source and target pair. Although different sources are matched towards the target, there may exist significant mis-alignment across different sources.

To address the above challenges, in this paper we propose a novel framework, termed Multi-source Adversarial Domain Aggregation Network (MADAN), which consists of Dynamic Adversarial Image Generation, Adversarial Domain Aggregation, and Feature-aligned Semantic Segmentation. First, for each source, we generate an adapted domain using a Generative Adversarial Network (GAN) Goodfellow et al. (2014) with cycle-consistency loss Zhu et al. (2017), which enforces pixel-level alignment between source images and target images. To preserve the semantics before and after image translation, we propose a novel semantic consistency loss by minimizing the KL divergence between the source predictions of a pretrained segmentation model and the adapted predictions of a dynamic segmentation model

. Second, instead of training a classifier for each source domain 

Xu et al. (2018); Peng et al. (2018), we propose sub-domain aggregation discriminator to directly make different adapted domains indistinguishable, and cross-domain cycle discriminator to discriminate between the images from each source and the images transferred from other sources. In this way, different adapted domains can be better aggregated into a more unified domain. Finally, the segmentation model is trained based on the aggregated domain, while enforcing feature-level alignment between the aggregated domain and the target domain.

In summary, our contributions are three-fold. (1) We propose to perform domain adaptation for semantic segmentation from multiple sources. To the best of our knowledge, this is the first work on multi-source structured domain adaptation. (2) We design a novel framework termed MADAN to do MDA for semantic segmentation. Besides feature-level alignment, pixel-level alignment is further considered by generating an adapted domain for each source cycle-consistently with a novel dynamic semantic consistency loss. Sub-domain aggregation discriminator and cross-domain cycle discriminator are proposed to better align different adapted domains. (3) We conduct extensive experiments from synthetic GTA Richter et al. (2016) and SYNTHIA Ros et al. (2016) to real Cityscapes Cordts et al. (2016) and BDDS Yu et al. (2018) datasets, and the results demonstrate the effectiveness of our proposed MADAN model.

2 Problem Setup

We consider the unsupervised MDA scenario, in which there are multiple labeled source domains , where is number of sources, and one unlabeled target domain . In the th source domain , suppose and are the observed data and corresponding labels drawn from the source distribution , where is the number of samples in . In the target domain , let denote the target data drawn from the target distribution without label observation, where is the number of target samples. Unless otherwise specified, we have two assumptions: (1) homogeneity, i.e. , indicating that the data from different domains are observed in the same image space but with different distributions; (2) closed set, i.e. , where is the label set, which means that all the domains share the same space of classes. Based on covariate shift and concept drift Patel et al. (2015), we aim to learn an adaptation model that can correctly predict the labels of a sample from the target domain trained on and .

Figure 1: The framework of the proposed Multi-source Adversarial Domain Aggregation Network (MADAN). The colored solid arrows represent generators, while the black solid arrows indicate the segmentation network . The dashed arrows correspond to different losses.

3 Multi-source Adversarial Domain Aggregation Network

In this section, we introduce the proposed Multi-source Adversarial Domain Aggregation Network (MADAN) for semantic segmentation adaptation. The framework is illustrated in Figure 1, which consists of three components: Dynamic Adversarial Image Generation (DAIG), Adversarial Domain Aggregation (ADA), and Feature-aligned Semantic Segmentation (FSS). DAIG aims to generate adapted images from source domains to the target domain from the perspective of visual appearance while preserving the semantic information with a dynamic segmentation model. In order to reduce the distances among the adapted domains and thus generate a more aggregated unified domain, ADA is proposed, including Cross-domain Cycle Discriminator (CCD) and Sub-domain Aggregation Discriminator (SAD). Finally, FSS learns the domain-invariant representations at the feature-level in an adversarial manner. Table 1 compares MADAN with several state-of-the-art DA methods.

3.1 Dynamic Adversarial Image Generation

The goal of DAIG is to make images from different source domains visually similar to the target images, as if they are drawn from the same target domain distribution. To this end, for each source domain , we introduce a generator mapping to the target in order to generate adapted images that fool , which is a pixel-level adversarial discriminator. is trained simultaneously with each to classify real target images from adapted images

. The corresponding GAN loss function is:

(1)
pixel feat sem cycle multi aggr one fine
ADDA Tzeng et al. (2017)
CycleGAN Zhu et al. (2017)
PiexlDA Bousmalis et al. (2017)
SBADA Russo et al. (2018)
GTA-GAN Sankaranarayanan et al. (2018a)
DupGAN Hu et al. (2018)
CyCADA Hoffman et al. (2018)
DCTN Xu et al. (2018)
MDAN Zhao et al. (2018a)
MMN Peng et al. (2018)
MADAN (ours)
Table 1: Comparison of the proposed MADAN model with several state-of-the-art domain adaptation methods. The full names of each property from the second to the last columns are pixel-level alignment, feature-level alignment, semantic consistency, cycle consistency, multiple sources, domain aggregation, one task network, and fine-grained prediction, respectively.

Since the mapping is highly under-constrained Goodfellow et al. (2014), we employ an inverse mapping as well as a cycle-consistency loss Zhu et al. (2017) to enforce and vice versa. Similarly, we introduce to classify from , with the following GAN loss:

(2)

The cycle-consistency loss Zhu et al. (2017) ensures that the learned mappings and are cycle-consistent, thereby preventing them from contradicting each other, is defined as:

(3)

The adapted images are expected to contain the same semantic information as original source images, but the semantic consistency is only partially constrained by the cycle consistency loss. The semantic consistency loss in CyCADA Hoffman et al. (2018) was proposed to better preserve semantic information. and are both fed into a segmentation model pretrained on . However, since and are from different domains, employing the same segmentation model, i.e. , to obtain the segmentation results and then computing the semantic consistency loss may be detrimental to image generation. Ideally, the adapted images should be fed into a network trained on the target domain, which is infeasible since target domain labels are not available in UDA. Instead of employing on , we propose to dynamically update the network , which takes as input, so that its optimal input domain (the domain that the network performs best on) gradually changes from that of to . We employ the task segmentation model trained on the adapted domain as , i.e. , which has two advantages: (1) becomes the optimal input domain of , and as is trained to have better performance on the target domain, the semantic loss after would promote to generate images that are closer to target domain at the pixel-level; (2) since and can share the parameters, no additional training or memory space is introduced, which is quite efficient. The proposed dynamic semantic consistency (DSC) loss is:

(4)

where is the KL divergence between two distributions.

3.2 Adversarial Domain Aggregation

We can train different segmentation models for each adapted domain and combine different predictions with specific weights for target images Xu et al. (2018); Peng et al. (2018), or we can simply combine all adapted domains together and train one model Zhao et al. (2018a). In the first strategy, it is challenging to determine how to select the weights for different adapted domains. Moreover, each target image needs to be fed into all segmentation models at reference time, and this is rather inefficient. For the second strategy, since the alignment space is high-dimensional, although the adapted domains are relatively aligned with the target, they may be significantly mis-aligned with each other. In order to mitigate this issue, we propose adversarial domain aggregation to make different adapted domains more closely aggregated with two kinds of discriminators. One is the sub-domain aggregation discriminator (SAD), which is designed to directly make the different adapted domains indistinguishable. For , a discriminator is introduced with the following loss function:

(5)

The other is the cross-domain cycle discriminator (CCD). For each source domain , we transfer the images from the adapted domains back to using and employ the discriminator to classify from , which corresponds to the following loss function:

(6)

Please note that using a more sophisticated combination of different discriminators’ losses to better aggregate the domains with larger distances might improve the performance. We leave this as future work and would explore this direction by dynamic weighting of the loss terms and incorporating some prior domain knowledge of the sources.

3.3 Feature-aligned Semantic Segmentation

After adversarial domain aggregation, the adapted images of different domains are more closely aggregated and aligned. Meanwhile, the semantic consistency loss in dynamic adversarial image generation ensures that the semantic information, i.e. the segmentation labels, is preserved before and after image translation. Suppose the images of the unified aggregated domain are and corresponding labels are . We can then train a task segmentation model based on and with the following cross-entropy loss:

(7)

where is the number of classes, are the height and width of the adapted images, is the softmax function, is an indicator function, and is the value of at index .

Further, we impose a feature-level alignment between and , which can improve the segmentation performance during inference of on the segmentation model . We introduce a discriminator to achieve this goal. The feature-level GAN loss is defined as:

(8)

where is the output of the last convolution layer (i.e. a feature map) of the encoder in .

3.4 MADAN Learning

The proposed MADAN learning framework utilizes adaptation techniques including pixel-level alignment, cycle-consistency, semantic consistency, domain aggregation, and feature-level alignment. Combining all these components, the overall objective loss function of MADAN is:

(9)

The training process corresponds to solving for a target model according to the optimization:

(10)

where and represent all the generators and discriminators in Eq. (9), respectively.

4 Experiments

In this section, we first introduce the experimental settings and then compare the segmentation results of the proposed MADAN and several state-of-the-art approaches both quantitatively and qualitatively, followed by some ablation studies.

Standards Method

road

sidewalk

building

wall

fence

pole

t-light

t-sign

vegettion

sky

person

rider

car

bus

m-bike

bicycle

mIoU

Source-only GTA 54.1 19.6 47.4 3.3 5.2 3.3 0.5 3.0 69.2 43.0 31.3 0.1 59.3 8.3 0.2 0.0 21.7
SYNTHIA 3.9 14.5 45.0 0.7 0.0 14.6 0.7 2.6 68.2 68.4 31.5 4.6 31.5 7.4 0.3 1.4 18.5
GTA+SYNTHIA 44.0 19.0 60.1 11.1 13.7 10.1 5.0 4.7 74.7 65.3 40.8 2.3 43.0 15.9 1.3 1.4 25.8
GTA-only DA FCN Wld Hoffman et al. (2016) 70.4 32.4 62.1 14.9 5.4 10.9 14.2 2.7 79.2 64.6 44.1 4.2 70.4 7.3 3.5 0.0 27.1
CDA Zhang et al. (2017) 74.8 22.0 71.7 6.0 11.9 8.4 16.3 11.1 75.7 66.5 38.0 9.3 55.2 18.9 16.8 14.6 28.9
ROAD Chen et al. (2018) 85.4 31.2 78.6 27.9 22.2 21.9 23.7 11.4 80.7 68.9 48.5 14.1 78.0 23.8 8.3 0.0 39.0
AdaptSeg Tsai et al. (2018) 87.3 29.8 78.6 21.1 18.2 22.5 21.5 11.0 79.7 71.3 46.8 6.5 80.1 26.9 10.6 0.3 38.3
CyCADA Hoffman et al. (2018) 85.2 37.2 76.5 21.8 15.0 23.8 22.9 21.5 80.5 60.7 50.5 9.0 76.9 28.2 4.5 0.0 38.7
DCAN Wu et al. (2018) 82.3 26.7 77.4 23.7 20.5 20.4 30.3 15.9 80.9 69.5 52.6 11.1 79.6 21.2 17.0 6.7 39.8
FCN Wld Hoffman et al. (2016) 11.5 19.6 30.8 4.4 0.0 20.3 0.1 11.7 42.3 68.7 51.2 3.8 54.0 3.2 0.2 0.6 20.2
CDA Zhang et al. (2017) 65.2 26.1 74.9 0.1 0.5 10.7 3.7 3.0 76.1 70.6 47.1 8.2 43.2 20.7 0.7 13.1 29.0
ROAD Chen et al. (2018) 77.7 30.0 77.5 9.6 0.3 25.8 10.3 15.6 77.6 79.8 44.5 16.6 67.8 14.5 7.0 23.8 36.2
CyCADA Hoffman et al. (2018) 66.2 29.6 65.3 0.5 0.2 15.1 4.5 6.9 67.1 68.2 42.8 14.1 51.2 12.6 2.4 20.7 29.2
SYNTHIA-only DA DCAN Wu et al. (2018) 79.9 30.4 70.8 1.6 0.6 22.3 6.7 23.0 76.9 73.9 41.9 16.7 61.7 11.5 10.3 38.6 35.4
Source-combined DA CyCADA Hoffman et al. (2018) 82.8 35.8 78.2 17.5 15.1 10.8 6.1 19.4 78.6 77.2 44.5 15.3 74.9 17.0 10.3 12.9 37.3
MDAN Zhao et al. (2018a) 64.2 19.7 63.8 13.1 19.4 5.5 5.2 6.8 71.6 61.1 42.0 12.0 62.7 2.9 12.3 8.1 29.4
Multi-source DA MADAN (Ours) 86.2 37.7 79.1 20.1 17.8 15.5 14.5 21.4 78.5 73.4 49.7 16.8 77.8 28.3 17.7 27.5 41.4
Oracle-Train on Tgt FCN Long et al. (2015a) 96.4 74.5 87.1 35.3 37.8 36.4 46.9 60.1 89.0 89.8 65.6 35.9 76.9 64.1 40.5 65.1 62.6
Table 2: Comparison with the state-of-the-art DA methods for semantic segmentation from GTA and SYNTHIA to Cityscapes. The best class-wise IoU and mIoU trained on the source domains are emphasized in bold (similar below).
Standards Method

road

sidewalk

building

wall

fence

pole

t-light

t-sign

vegettion

sky

person

rider

car

bus

m-bike

bicycle

mIoU

Source-only GTA 50.2 18.0 55.1 3.1 7.8 7.0 0.0 3.5 61.0 50.4 19.2 0.0 58.1 3.2 19.8 0.0 22.3
SYNTHIA 7.0 6.0 50.5 0.0 0.0 15.1 0.2 2.4 60.3 85.6 16.5 0.5 36.7 3.3 0.0 3.5 17.1
GTA+SYNTHIA 54.5 19.6 64.0 3.2 3.6 5.2 0.0 0.0 61.3 82.2 13.9 0.0 55.5 16.7 13.4 0.0 24.6
GTA-only DA CyCADA Hoffman et al. (2018) 77.9 26.8 68.8 13.0 19.7 13.5 18.2 22.3 64.2 84.2 39.0 22.6 72.0 11.5 15.9 2.0 35.7
SYNTHIA-only DA CyCADA Hoffman et al. (2018) 55 13.8 45.2 0.1 0.0 13.2 0.5 10.6 63.3 67.4 22.0 6.9 52.5 10.5 10.4 13.3 24.0
Source-combined DA CyCADA Hoffman et al. (2018) 61.5 27.6 72.1 6.5 2.8 15.7 10.8 18.1 78.3 73.8 44.9 16.3 41.5 21.1 21.8 25.9 33.7
MDAN Zhao et al. (2018a) 35.9 15.8 56.9 5.8 16.3 9.5 8.6 6.2 59.1 80.1 24.5 9.9 53.8 11.8 2.9 1.6 25.0
Multi-source DA MADAN (Ours) 60.2 29.5 66.6 16.9 10.0 16.6 10.9 16.4 78.8 75.1 47.5 17.3 48.0 24.0 13.2 17.3 36.3
Oracle-Train on Tgt FCN Long et al. (2015a) 91.7 54.7 79.5 25.9 42.0 23.6 30.9 34.6 81.2 91.6 49.6 23.5 85.4 64.2 28.4 41.1 53.0
Table 3: Comparison with the state-of-the-art DA methods for semantic segmentation from GTA and SYNTHIA to BDDS. The best class-wise IoU and mIoU are emphasized in bold.

4.1 Experimental Settings

Datasets. In our adaptation experiments, we use synthetic GTA Richter et al. (2016) and SYNTHIA Ros et al. (2016) datasets as the source domains and real Cityscapes Cordts et al. (2016) and BDDS Yu et al. (2018) datasets as the target domains.

Baselines. We compare MADAN with the following methods. (1) Source-only, i.e. train on the source domains and test on the target domain directly. We can view this as a lower bound of DA. (2) Single-source DA, perform multi-source DA via single-source DA, including FCNs Wld Hoffman et al. (2016), CDA Zhang et al. (2017), ROAD Chen et al. (2018), AdaptSeg Tsai et al. (2018), CyCADA Hoffman et al. (2018), and DCAN Wu et al. (2018). (3) Multi-source DA, extend some single-source DA method to multi-source settings, including MDAN Zhao et al. (2018a). For comparison, we also report the results of an oracle setting, where the segmentation model is both trained and tested on the target domain. For the source-only and single-source DA standards, we employ two strategies: (1) single-source, i.e. performing adaptation on each single source; (2) source-combined, i.e. all source domains are combined into a traditional single source. For MDAN, we extend the original classification network for our segmentation task.

Evaluation Metric. Following Hoffman et al. (2016); Zhang et al. (2017); Hoffman et al. (2018); Yue et al. (2019), we employ mean intersection-over-union (mIoU) to evaluate the segmentation results. In the experiments, we take the 16 intersection classes of GTA and SYNTHIA, compatible with Cityscapes and BDDS, for all mIoU evaluations.

Implementation Details. Although MADAN could be trained in an end-to-end manner, due to constrained hardware resources, we train it in three stages. First, we train two CycleGANs (9 residual blocks for generator and 4 convolution layers for discriminator) Zhu et al. (2017) without semantic consistency loss, and then train an FCN on the adapted images with corresponding labels from the source domains. Second, after updating with trained above, we generate adapted images using CycleGAN with the proposed DSC loss in Eq. (4) and aggregate different adapted domains using SAD and CCD. Finally, we train an FCN on the newly adapted images in the aggregated domain with feature-level alignment. The above stages are trained iteratively.

We choose to use FCN Long et al. (2015a) as our semantic segmentation network, and, as the VGG family of networks is commonly used in reporting DA results, we use VGG-16 Simonyan and Zisserman (2015)

as the FCN backbone. The weights of the feature extraction layers in the networks are initialized from models trained on ImageNet 

Deng et al. (2009)

. The network is implemented in PyTorch and trained with Adam optimizer 

Kingma and Ba (2015) using a batch size of 8 with initial learning rate 1e-4. All the images are resized to , and are then cropped to

during the training of the pixel-level adaptation for 20 epochs. SAD and CCD are frozen in the first 5 and 10 epochs, respectively.

4.2 Comparison with State-of-the-art

The performance comparisons between the proposed MADAN model and the other baselines, including source-only, single-source DA, and multi-source DA, as measured by class-wise IoU and mIoU are shown in Table 2 and Table 3. From the results, we have the following observations:

Figure 2: Qualitative semantic segmentation result from GTA and SYNTHIA to Cityscapes. From left to right are: (a) original image, (b) ground truth annotation, (c) source only from GTA, (d) CycleGANs on GTA and SYNTHIA, (e) +CCD+DSC, (f) +SAD+DSC, (g) +CCD+SAD+DSC, and (h) +CCD+SAD+DSC+Feat (MADAN).
Figure 3: Visualization of image translation. From left to right are: (a) original source image, (b) CycleGAN, (c) CycleGAN+DSC, (d) CycleGAN+CCD+DSC, (e) CycleGAN+SAD+DSC, (f) CycleGAN+CCD+SAD+DSC, and (g) target Cityscapes image. The top two rows and bottom rows are GTA Cityscapes and SYNTHIA Cityscapes, respectively.

(1)

The source-only method that directly transfers the segmentation models trained on the source domains to the target domain obtains the worst performance in most adaptation settings. This is obvious, because the joint probability distributions of observed images and labels are significantly different among the sources and the target, due to the presence of domain shift. Without domain adaptation, the direct transfer cannot well handle this domain gap. Simply combining different source domains performs better than each single source, which indicates the superiority of multiple sources over single source despite the domain shift among different sources.

Source Method

road

sidewalk

building

wall

fence

pole

t-light

t-sign

vegettion

sky

person

rider

car

bus

m-bike

bicycle

mIoU

CycleGAN+SC 85.6 30.7 74.7 14.4 13.0 17.6 13.7 5.8 74.6 69.9 38.2 3.5 72.3 5.0 3.6 0.0 32.7
CycleGAN+DSC 76.6 26.0 76.3 17.3 18.8 13.6 13.2 17.9 78.8 63.9 47.4 14.8 72.2 24.1 19.8 10.8 38.1
CyCADA w/ SC 85.2 37.2 76.5 21.8 15.0 23.8 21.5 22.9 80.5 60.7 50.5 9.0 76.9 28.2 9.8 0.0 38.7
GTA CyCADA w/ DSC 84.1 27.3 78.3 21.6 18.0 13.8 14.1 16.7 78.1 66.9 47.8 15.4 78.7 23.4 22.3 14.4 40.0
CycleGAN+SC 64.0 29.4 61.7 0.3 0.1 15.3 3.4 5.0 63.4 68.4 39.4 11.5 46.6 10.4 2.0 16.4 27.3
CycleGAN + DSC 68.4 29.0 65.2 0.6 0.0 15.0 0.1 4.0 75.1 70.6 45.0 11.0 54.9 18.2 3.9 26.7 30.5
CyCADA w/ SC 66.2 29.6 65.3 0.5 0.2 15.1 4.5 6.9 67.1 68.2 42.8 14.1 51.2 12.6 2.4 20.7 29.2
SYNTHIA CyCADA w/ DSC 69.8 27.2 68.5 5.8 0.0 11.6 0.0 2.8 75.7 58.3 44.3 10.5 68.1 22.1 11.8 32.7 31.8
Table 4: Comparison between the proposed dynamic semantic consistency (DSC) loss in MADAN and the original SC loss in Hoffman et al. (2018) on Cityscapes. The better mIoU for each pair is emphasized in bold.
Source Method

road

sidewalk

building

wall

fence

pole

t-light

t-sign

vegettion

sky

person

rider

car

bus

m-bike

bicycle

mIoU

CycleGAN+SC 62.1 20.9 59.2 6.0 23.5 12.8 9.2 22.4 65.9 78.4 34.7 11.4 64.4 14.2 10.9 1.9 31.1
CycleGAN+DSC 74.4 23.7 65.0 8.6 17.2 10.7 14.2 19.7 59.0 82.8 36.3 19.6 69.7 4.3 17.6 4.2 32.9
CyCADA w/ SC 68.8 23.7 67.0 7.5 16.2 9.4 11.3 22.2 60.5 82.1 36.1 20.6 63.2 15.2 16.6 3.4 32.0
GTA CyCADA w/ DSC 70.5 32.4 68.2 10.5 17.3 18.4 16.6 21.8 65.6 82.2 38.1 16.1 73.3 20.8 12.6 3.7 35.5
CycleGAN+SC 50.6 13.6 50.5 0.2 0.0 7.9 0.0 0.0 63.8 58.3 21.6 7.8 50.2 1.8 2.2 19.9 21.8
CycleGAN + DSC 57.3 13.4 56.1 2.7 14.1 9.8 7.7 17.1 65.5 53.1 11.4 1.4 51.4 13.9 3.9 8.7 22.5
CyCADA w/ SC 49.5 11.1 46.6 0.7 0.0 10.0 0.4 7.0 61.0 74.6 17.5 7.2 50.9 5.8 13.1 4.3 23.4
SYNTHIA CyCADA w/ DSC 55 13.8 45.2 0.1 0.0 13.2 0.5 10.6 63.3 67.4 22.0 6.9 52.5 10.5 10.4 13.3 24.0
Table 5: Comparison between the proposed dynamic semantic consistency (DSC) loss in MADAN and the original SC loss in Hoffman et al. (2018) on BDDS. The better mIoU for each pair is emphasized in bold.

(2) Comparing source-only with single-source DA respectively on GTA and SYNTHIA, it is clear that all adaptation methods perform better, which demonstrates the effectiveness of domain adaptation in semantic segmentation. Comparing the results of CyCADA in single-source and source-combined settings, we can conclude that simply combining different source domains and performing single-source DA may result in performance degradation.

(3) MADAN achieves the highest mIoU score among all adaptation methods, and benefits from the joint consideration of pixel-level and feature-level alignments, cycle-consistency, dynamic semantic-consistency, domain aggregation, and multiple sources. MADAN also significantly outperforms source-combined DA, in which domain shift also exists among different sources. By bridging this gap, multi-source DA can boost the adaptation performance. On the one hand, compared to single-source DA Hoffman et al. (2016); Zhang et al. (2017); Chen et al. (2018); Tsai et al. (2018); Hoffman et al. (2018); Wu et al. (2018), MADAN utilizes more useful information from multiple sources. On the other hand, other multi-source DA methods Xu et al. (2018); Zhao et al. (2018a); Peng et al. (2018) only consider feature-level alignment, which may be enough for course-grained tasks, e.g. image classification, but is obviously insufficient for fine-grained tasks, e.g. semantic segmentation, a pixel-wise prediction task. In addition, we consider pixel-level alignment with a dynamic semantic consistency loss and further aggregate different adapted domains.

(4) The oracle method that is trained on the target domain performs significantly better than the others. However, to train this model, the ground truth segmentation labels from the target domain are required, which are actually unavailable in UDA settings. We can deem this performance as a upper bound of UDA. Obviously, a large performance gap still exists between all adaptation algorithms and the oracle method, requiring further efforts on DA.

Visualization. The qualitative semantic segmentation results are shown in Figure 2. We can clearly see that after adaptation by the proposed method, the visual segmentation results are improved notably. We also visualize the results of pixel-level alignment from GTA and SYNTHIA to Cityscapes in Figure 3. We can see that with our final proposed pixel-level alignment method (f), the styles of the images are close to Cityscapes while the semantic information is well preserved.

4.3 Ablation Study

First, we compare the proposed dynamic semantic consistency (DSC) loss in MADAN with the original semantic consistency (SC) loss in CyCADA Hoffman et al. (2018). As shown in Table 4 and Table 5, we can see that for all simulation to real adaptations, DSC achieves better results. After demonstrating its value, we employ the DSC loss in subsequent experiments.

Second, we incrementally investigate the effectiveness of different components in MADAN on both Cityscapes and BDDS. The results are shown in Table 6 and Table 7. We can observe that: (1) both domain aggregation methods, i.e. SAD and CCD, can obtain better performance by making different adapted domains more closely aggregated, while SAD outperforms CCD; (2) adding the DSC loss could further improve the mIoU score, again demonstrating the effectiveness of DSC; (3) feature-level alignments also contribute to the adaptation task; (4) the modules are orthogonal to each other to some extent, since adding each one of them does not introduce performance degradation.

Method

road

sidewalk

building

wall

fence

pole

t-light

t-sign

vegettion

sky

person

rider

car

bus

m-bike

bicycle

mIoU

Baseline 74.9 27.6 67.5 9.1 10.0 12.8 1.4 13.6 63.0 47.1 41.7 13.5 60.8 22.4 6.0 8.1 30.0
+SAD 79.7 33.2 75.9 11.8 3.6 15.9 8.6 15.0 74.7 78.9 44.2 17.1 68.2 24.9 16.7 14.0 36.4
+CCD 82.1 36.3 69.8 9.5 4.9 11.8 12.5 15.3 61.3 54.1 49.7 10.0 70.7 9.7 19.7 12.4 33.1
+SAD+CCD 82.7 35.3 76.5 15.4 19.4 14.1 7.2 13.9 75.3 74.2 50.9 19.0 66.5 26.6 16.3 6.7 37.5
+SAD+DSC 83.1 36.6 78.0 23.3 12.6 11.8 3.5 11.3 75.5 74.8 42.2 17.9 72.2 27.2 13.8 10.0 37.1
+CCD+DSC 86.8 36.9 78.6 16.2 8.1 17.7 8.9 13.7 75.0 74.8 42.2 18.2 74.6 22.5 22.9 12.7 38.1
+SAD+CCD+DSC 84.2 35.1 78.7 17.1 18.7 15.4 15.7 24.1 77.9 72.0 49.2 17.1 75.2 24.1 18.9 19.2 40.2
+SAD+CCD+DSC+Feat 86.2 37.7 79.1 20.1 17.8 15.5 14.5 21.4 78.5 73.4 49.7 16.8 77.8 28.3 17.7 27.5 41.4
Table 6: Ablation study on different components in MADAN on Cityscapes. Baseline denotes using piexl-level alignment with cycle-consistency, +SAD denotes using the sub-domain aggregation discriminator, +CCD denotes using the cross-domain cycle discriminator, +DSC denotes using the dynamic semantic consistency loss, and +Feat denotes using feature-level alignment.
Method

road

sidewalk

building

wall

fence

pole

t-light

t-sign

vegettion

sky

person

rider

car

bus

m-bike

bicycle

mIoU

Baseline 31.3 17.4 55.4 2.6 12.9 12.4 6.5 18.0 63.2 79.9 21.2 5.6 44.1 14.2 6.1 11.7 24.6
+SAD 58.9 18.7 61.8 6.4 10.7 17.1 20.3 17.0 67.3 83.7 21.1 6.7 66.6 22.7 4.5 14.9 31.2
+CCD 52.7 13.6 63.0 6.6 11.2 17.8 21.5 18.9 67.4 84.0 9.2 2.2 63.0 21.6 2.0 14.0 29.3
+SAD+CCD 61.6 20.2 61.7 7.2 12.1 18.5 19.8 16.7 64.2 83.2 25.9 7.3 66.8 22.2 5.3 14.9 31.8
+SAD+DSC 60.2 29.5 66.6 16.9 10.0 16.6 10.9 16.4 78.8 75.1 47.5 17.3 48.0 24.0 13.2 17.3 34.3
+CCD+DSC 61.5 27.6 72.1 6.5 12.8 15.7 10.8 18.1 78.3 73.8 44.9 16.3 41.5 21.1 21.8 15.9 33.7
+SAD+CCD+DSC 64.6 38.0 75.8 17.8 13.0 9.8 5.9 4.6 74.8 76.9 41.8 24.0 69.0 20.4 23.7 11.3 35.3
+SAD+CCD+DSC+Feat 69.1 36.3 77.9 21.5 17.4 13.8 4.1 16.2 76.5 76.2 42.2 16.4 56.3 22.4 24.5 13.5 36.3
Table 7: Ablation study on different components in MADAN on BDDS.

4.4 Discussions

Computation cost. Since the proposed framework deals with a harder problem, i.e. multi-source domain adaptation, more modules are used to align different sources, which results in a larger model. In our experiments, MADAN is trained on 4 NVIDIA Tesla P40 GPUs for 40 hours using two source domains which is about twice the training time as on a single source. However, MADAN does not introduce any additional computation during inference, which is the biggest concern in real industrial applications, e.g. autonomous driving.

On the poorly performing classes. There are two main reasons for the poor performance on certain classes (e.g. fence and pole): 1) lack of images containing these classes and 2) structural differences of objects between simulation images and real images (e.g. the trees in simulation images are much taller than those in real images). Generating more images for different classes and improving the diversity of objects in the simulation environment are two promising directions for us to explore in future work that may help with these problems.

5 Conclusion

In this paper, we studied multi-source domain adaptation for semantic segmentation from synthetic data to real data. A novel framework, termed Multi-source Adversarial Domain Aggregation Network (MADAN), is designed with three components. For each source domain, we generated adapted images with a novel dynamic semantic consistency loss. Further we proposed a sub-domain aggregation discriminator and cross-domain cycle discriminator to better aggregate different adapted domains. Together with other techniques such as pixel- and feature-level alignments as well as cycle-consistency, MADAN achieves 15.6%, 1.6%, 4.1%, and 12.0% mIoU improvements compared with best source-only, best single-source DA, source-combined DA, and other multi-source DA, respectively on Cityscapes from GTA and SYNTHIA, and 11.7%, 0.6%, 2.6%, 11.3% on BDDS. For further studies, we plan to investigate multi-modal DA, such as using both image and LiDAR data, to better boost the adaptation performance. Improving the computational efficiency of MADAN, with techniques such as neural architecture search, is another direction worth investigating.

Acknowledgments

This work is supported by Berkeley DeepDrive and the National Natural Science Foundation of China (No. 61701273).

References

  • [1] V. Badrinarayanan, A. Kendall, and R. Cipolla (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (12), pp. 2481–2495. Cited by: §1.
  • [2] C. J. Becker, C. M. Christoudias, and P. Fua (2013) Non-linear domain adaptation with boosting. In Advances in Neural Information Processing Systems, pp. 485–493. Cited by: §1.
  • [3] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan (2010) A theory of learning from different domains. Machine learning 79 (1-2), pp. 151–175. Cited by: §1.
  • [4] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan (2017) Unsupervised pixel-level domain adaptation with generative adversarial networks. In

    IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 3722–3731. Cited by: §1, Table 1.
  • [5] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan (2016) Domain separation networks. In Advances in Neural Information Processing Systems, pp. 343–351. Cited by: §1.
  • [6] R. Chattopadhyay, Q. Sun, W. Fan, I. Davidson, S. Panchanathan, and J. Ye (2012) Multisource domain adaptation and its application to early detection of fatigue. ACM Transactions on Knowledge Discovery from Data 6 (4), pp. 18. Cited by: §1.
  • [7] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (4), pp. 834–848. Cited by: §1.
  • [8] Y. Chen, W. Li, and L. Van Gool (2018) Road: reality oriented adaptation for semantic segmentation of urban scenes. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 7892–7901. Cited by: §1, §4.1, §4.2, Table 2.
  • [9] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger (2016) 3D u-net: learning dense volumetric segmentation from sparse annotation. In International Conference on Medical Image Computing and Computer Assisted Intervention, pp. 424–432. Cited by: §1.
  • [10] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    .
    In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223. Cited by: §1, §1, §4.1.
  • [11] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. Cited by: §4.1.
  • [12] L. Duan, I. W. Tsang, D. Xu, and T. Chua (2009) Domain adaptation from multiple sources via auxiliary classifiers. In International Conference on Machine Learning, pp. 289–296. Cited by: §1.
  • [13] L. Duan, D. Xu, and S. Chang (2012) Exploiting web images for event recognition in consumer videos: a multiple source domain adaptation approach. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1338–1345. Cited by: §1.
  • [14] L. Duan, D. Xu, and I. W. Tsang (2012) Domain adaptation from multiple sources: a domain-dependent regularization approach. IEEE Transactions on Neural Networks and Learning Systems 23 (3), pp. 504–518. Cited by: §1.
  • [15] A. Dundar, M. Liu, T. Wang, J. Zedlewski, and J. Kautz (2018) Domain stylization: a strong, simple baseline for synthetic to real image domain adaptation. arXiv:1807.09384. Cited by: §1.
  • [16] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. Cited by: §1.
  • [17] M. Ghifary, W. Bastiaan Kleijn, M. Zhang, and D. Balduzzi (2015)

    Domain generalization for object recognition with multi-task autoencoders

    .
    In IEEE International Conference on Computer Vision, pp. 2551–2559. Cited by: §1.
  • [18] M. Ghifary, W. B. Kleijn, M. Zhang, D. Balduzzi, and W. Li (2016) Deep reconstruction-classification networks for unsupervised domain adaptation. In European Conference on Computer Vision, pp. 597–613. Cited by: §1.
  • [19] X. Glorot, A. Bordes, and Y. Bengio (2011) Domain adaptation for large-scale sentiment classification: a deep learning approach. In International Conference on Machine Learning, pp. 513–520. Cited by: §1.
  • [20] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680. Cited by: §1, §1, §3.1.
  • [21] R. Gopalan, R. Li, and R. Chellappa (2014) Unsupervised adaptation across domain shifts by generating intermediate data representations. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (11), pp. 2288–2302. Cited by: §1.
  • [22] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell (2018) CyCADA: cycle-consistent adversarial domain adaptation. In International Conference on Machine Learning, pp. 1994–2003. Cited by: §1, §3.1, Table 1, §4.1, §4.1, §4.2, §4.3, Table 2, Table 3, Table 4, Table 5.
  • [23] J. Hoffman, D. Wang, F. Yu, and T. Darrell (2016) Fcns in the wild: pixel-level adversarial and constraint-based adaptation. arXiv:1612.02649. Cited by: §1, §4.1, §4.1, §4.2, Table 2.
  • [24] Z. Hong, Y. Chen, H. Yang, S. Su, T. Shann, Y. Chang, B. H. Ho, C. Tu, T. Hsiao, H. Hsiao, et al. (2018) Virtual-to-real: learning to control in visual semantic segmentation. In

    International Joint Conference on Artificial Intelligence

    ,
    pp. 4912–4920. Cited by: §1.
  • [25] L. Hu, M. Kan, S. Shan, and X. Chen (2018) Duplex generative adversarial network for unsupervised domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1498–1507. Cited by: §1, Table 1.
  • [26] S. Jaradat (2017) Deep cross-domain fashion recommendation. In ACM Conference on Recommender Systems, pp. 407–410. Cited by: §1.
  • [27] I. Jhuo, D. Liu, D. Lee, and S. Chang (2012) Robust visual domain adaptation with low-rank reconstruction. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2168–2175. Cited by: §1.
  • [28] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations, Cited by: §4.1.
  • [29] G. Lin, C. Shen, A. Van Den Hengel, and I. Reid (2016) Efficient piecewise training of deep structured models for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3194–3203. Cited by: §1.
  • [30] M. Liu and O. Tuzel (2016) Coupled generative adversarial networks. In Advances in Neural Information Processing Systems, pp. 469–477. Cited by: §1.
  • [31] Z. Liu, X. Li, P. Luo, C. Loy, and X. Tang (2015) Semantic image segmentation via deep parsing network. In IEEE International Conference on Computer Vision, pp. 1377–1385. Cited by: §1.
  • [32] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440. Cited by: §1, §4.1, Table 2, Table 3.
  • [33] M. Long, Y. Cao, J. Wang, and M. Jordan (2015) Learning transferable features with deep adaptation networks. In International Conference on Machine Learning, pp. 97–105. Cited by: §1.
  • [34] C. Louizos, K. Swersky, Y. Li, M. Welling, and R. Zemel (2015) The variational fair autoencoder. arXiv:1511.00830. Cited by: §1.
  • [35] S. J. Pan and Q. Yang (2010)

    A survey on transfer learning

    .
    IEEE Transactions on Knowledge and Data Engineering 22 (10), pp. 1345–1359. Cited by: §1.
  • [36] V. M. Patel, R. Gopalan, R. Li, and R. Chellappa (2015) Visual domain adaptation: a survey of recent advances. IEEE Signal Processing Magazine 32 (3), pp. 53–69. Cited by: §1, §2.
  • [37] X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, and B. Wang (2018) Moment matching for multi-source domain adaptation. arXiv:1812.01754. Cited by: §1, §1, §3.2, Table 1, §4.2.
  • [38] X. Peng, B. Usman, N. Kaushik, J. Hoffman, D. Wang, and K. Saenko (2017) Visda: the visual domain adaptation challenge. arXiv:1710.06924. Cited by: §1.
  • [39] S. R. Richter, V. Vineet, S. Roth, and V. Koltun (2016) Playing for data: ground truth from computer games. In European Conference on Computer Vision, pp. 102–118. Cited by: §1, §1, §4.1.
  • [40] M. Riemer, I. Cases, R. Ajemian, M. Liu, I. Rish, Y. Tu, and G. Tesauro (2019) Learning to learn without forgetting by maximizing transfer and minimizing interference. In International Conference on Learning Representations, Cited by: §1.
  • [41] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez (2016) The synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3234–3243. Cited by: §1, §1, §4.1.
  • [42] P. Russo, F. M. Carlucci, T. Tommasi, and B. Caputo (2018) From source to target and back: symmetric bi-directional adaptive gan. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 8099–8108. Cited by: §1, Table 1.
  • [43] S. Sankaranarayanan, Y. Balaji, C. D. Castillo, and R. Chellappa (2018) Generate to adapt: aligning domains using generative adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 8503–8512. Cited by: §1, Table 1.
  • [44] S. Sankaranarayanan, Y. Balaji, A. Jain, S. Nam Lim, and R. Chellappa (2018) Learning from synthetic data: addressing domain shift for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3752–3761. Cited by: §1.
  • [45] G. Schweikert, G. Rätsch, C. Widmer, and B. Schölkopf (2009) An empirical analysis of domain adaptation algorithms for genomic sequence analysis. In Advances in Neural Information Processing Systems, pp. 1433–1440. Cited by: §1.
  • [46] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, Cited by: §4.1.
  • [47] B. Sun, J. Feng, and K. Saenko (2016) Return of frustratingly easy domain adaptation. In AAAI Conference on Artificial Intelligence, pp. 2058–2065. Cited by: §1.
  • [48] B. Sun, J. Feng, and K. Saenko (2017) Correlation alignment for unsupervised domain adaptation. In Domain Adaptation in Computer Vision Applications, pp. 153–171. Cited by: §1.
  • [49] Q. Sun, R. Chattopadhyay, S. Panchanathan, and J. Ye (2011) A two-stage weighting framework for multi-source domain adaptation. In Advances in Neural Information Processing Systems, pp. 505–513. Cited by: §1.
  • [50] S. Sun and H. Shi (2013) Bayesian multi-source domain adaptation. In International Conference on Machine Learning and Cybernetics, Vol. 1, pp. 24–28. Cited by: §1.
  • [51] S. Sun, H. Shi, and Y. Wu (2015) A survey of multi-source domain adaptation. Information Fusion 24, pp. 84–92. Cited by: §1.
  • [52] A. Torralba and A. A. Efros (2011) Unbiased look at dataset bias. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1521–1528. Cited by: §1.
  • [53] Y. Tsai, W. Hung, S. Schulter, K. Sohn, M. Yang, and M. Chandraker (2018) Learning to adapt structured output space for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 7472–7481. Cited by: §1, §4.1, §4.2, Table 2.
  • [54] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell (2017) Adversarial discriminative domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2962–2971. Cited by: §1, Table 1.
  • [55] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell (2018) Understanding convolution for semantic segmentation. In IEEE Winter Conference on Applications of Computer Vision, pp. 1451–1460. Cited by: §1.
  • [56] B. Wu, X. Zhou, S. Zhao, X. Yue, and K. Keutzer (2019) Squeezesegv2: improved model structure and unsupervised domain adaptation for road-object segmentation from a lidar point cloud. In IEEE International Conference on Robotics and Automation, pp. 4376–4382. Cited by: §1.
  • [57] Z. Wu, X. Han, Y. Lin, M. Gokhan Uzunbas, T. Goldstein, S. Nam Lim, and L. S. Davis (2018) Dcan: dual channel-wise alignment networks for unsupervised scene adaptation. In European Conference on Computer Vision, pp. 518–534. Cited by: §1, §4.1, §4.2, Table 2.
  • [58] R. Xu, Z. Chen, W. Zuo, J. Yan, and L. Lin (2018) Deep cocktail network: multi-source unsupervised domain adaptation with category shift. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3964–3973. Cited by: §1, §1, §3.2, Table 1, §4.2.
  • [59] Z. Xu and S. Sun (2012) Multi-source transfer learning with multi-view adaboost. In International Conference on Neural information processing, pp. 332–339. Cited by: §1.
  • [60] J. Yang, R. Yan, and A. G. Hauptmann (2007) Cross-domain video concept detection using adaptive svms. In ACM International Conference on Multimedia, pp. 188–197. Cited by: §1.
  • [61] F. Yu and V. Koltun (2016) Multi-scale context aggregation by dilated convolutions. In International Conference on Learning Representations, Cited by: §1.
  • [62] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, and T. Darrell (2018) BDD100K: a diverse driving video database with scalable annotation tooling. arXiv:1805.04687. Cited by: §1, §4.1.
  • [63] X. Yue, B. Wu, S. A. Seshia, K. Keutzer, and A. L. Sangiovanni-Vincentelli (2018) A lidar point cloud generator: from a virtual world to autonomous driving. In ACM International Conference on Multimedia Retrieval, pp. 458–464. Cited by: §1.
  • [64] X. Yue, Y. Zhang, S. Zhao, A. Sangiovanni-Vincentelli, K. Keutzer, and B. Gong (2019) Domain randomization and pyramid consistency: simulation-to-real generalization without accessing target domain data. In IEEE International Conference on Computer Vision, Cited by: §1, §4.1.
  • [65] Y. Zhang, P. David, and B. Gong (2017) Curriculum domain adaptation for semantic segmentation of urban scenes. In IEEE International Conference on Computer Vision, pp. 2020–2030. Cited by: §1, §1, §4.1, §4.1, §4.2, Table 2.
  • [66] Y. Zhang, Z. Qiu, T. Yao, D. Liu, and T. Mei (2018) Fully convolutional adaptation networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 6810–6818. Cited by: §1.
  • [67] H. Zhao, S. Zhang, G. Wu, J. M. Moura, J. P. Costeira, and G. J. Gordon (2018) Adversarial multiple source domain adaptation. In Advances in Neural Information Processing Systems, pp. 8568–8579. Cited by: §1, §3.2, Table 1, §4.1, §4.2, Table 2, Table 3.
  • [68] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881–2890. Cited by: §1.
  • [69] S. Zhao, C. Lin, P. Xu, S. Zhao, Y. Guo, R. Krishna, G. Ding, and K. Keutzer (2019) CycleEmotionGAN: emotional semantic consistency preserved cyclegan for adapting image emotions. In AAAI Conference on Artificial Intelligence, pp. 2620–2627. Cited by: §1.
  • [70] S. Zhao, X. Zhao, G. Ding, and K. Keutzer (2018) EmotionGAN: unsupervised domain adaptation for learning discrete probability distributions of image emotions. In ACM International Conference on Multimedia, pp. 1319–1327. Cited by: §1.
  • [71] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr (2015)

    Conditional random fields as recurrent neural networks

    .
    In IEEE International Conference on Computer Vision, pp. 1529–1537. Cited by: §1.
  • [72] B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba (2019) Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision 127 (3), pp. 302–321. Cited by: §1.
  • [73] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017)

    Unpaired image-to-image translation using cycle-consistent adversarial networks

    .
    In IEEE International Conference on Computer Vision, pp. 2223–2232. Cited by: §1, §1, §3.1, Table 1, §4.1.
  • [74] X. Zhu, H. Zhou, C. Yang, J. Shi, and D. Lin (2018) Penalizing top performers: conservative loss for semantic segmentation adaptation. In European Conference on Computer Vision, pp. 568–583. Cited by: §1.
  • [75] J. Zhuo, S. Wang, W. Zhang, and Q. Huang (2017) Deep unsupervised convolutional domain adaptation. In ACM International Conference on Multimedia, pp. 261–269. Cited by: §1.