Self-Adversarial Disentangling for Specific Domain Adaptation

08/08/2021 ∙ by Qianyu Zhou, et al. ∙ SenseTime Corporation Shanghai Jiao Tong University Deakin University 23

Domain adaptation aims to bridge the domain shifts between the source and target domains. These shifts may span different dimensions such as fog, rainfall, etc. However, recent methods typically do not consider explicit prior knowledge on a specific dimension, thus leading to less desired adaptation performance. In this paper, we study a practical setting called Specific Domain Adaptation (SDA) that aligns the source and target domains in a demanded-specific dimension. Within this setting, we observe the intra-domain gap induced by different domainness (i.e., numerical magnitudes of this dimension) is crucial when adapting to a specific domain. To address the problem, we propose a novel Self-Adversarial Disentangling (SAD) framework. In particular, given a specific dimension, we first enrich the source domain by introducing a domainness creator with providing additional supervisory signals. Guided by the created domainness, we design a self-adversarial regularizer and two loss functions to jointly disentangle the latent representations into domainness-specific and domainness-invariant features, thus mitigating the intra-domain gap. Our method can be easily taken as a plug-and-play framework and does not introduce any extra costs in the inference time. We achieve consistent improvements over state-of-the-art methods in both object detection and semantic segmentation tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 4

page 7

page 9

page 10

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Over the past several years, deep neural networks have brought impressive advances in many computer vision tasks, such as object detection 

[16, 17, 45] and semantic segmentation [36, 4, 5, 71, 11]. However, the model trained in a source domain will suffer from serious performance degradation when applying to a novel domain, which limits its generalization ability in complicated real-world scenarios. Annotating a large-scale dataset for each new domain is cost-expensive and time-consuming. Unsupervised domain adaptation (UDA) emerges, which shows promising results on object detection [6, 47, 23, 76, 63, 3, 62, 51, 26, 24, 53, 70, 72, 19] and semantic segmentation [75, 39, 25, 55, 33, 29, 40, 61, 2, 67, 37, 66, 43, 65, 60, 58, 7, 38, 79, 78, 77, 56, 20], aiming to reduce the domain shifts between the source and the target domains.

Fig. 1: (a) Previous UDA methods do not leverage explicit prior knowledge to perform domain adaptation on a demand-specific dimension, and (b) they can not generalize well to a target domain with different unknown domainness. (c) Our method narrows the intra-domain gap induced by different domainness. Different domainness indicate the different numerical magnitudes of a specific domain dimension.

The domain shifts may span different dimensions such as fog, rainfall, Field of View (FoV), etc. A practical scenario is to align the source and the target domain in a demanded specific dimension, e.g., from sunny images to foggy images. However, existing methods can hardly handle such cases elegantly. This is mainly because they do not consider any explicit prior knowledge about the specific domain shifts. As a result, the model will lack a clear target dimension and will be optimized without the aforementioned prior knowledge. This under-constrained training process limits the performance when adapting the model to a specific dimension. Meanwhile, the intra-domain gaps, which commonly exist among the domains in the same dimension but with different domainness, are taken out of consideration in previous research. As shown in Fig. 1, different domainness values indicate different numerical magnitudes of a specific domain dimension. Narrowing such intra-domain gaps is crucial when adapting to a specific dimension.

In this paper, we refer to the above explained problem as Specific Domain Adaptation (SDA), a realistic and practical setting for domain adaptation. It targets to generalize a model across a specific domain dimension, e.g., different FoVs (FoV dimension), and the model can be broadly applied in real-world applications. For example, in autonomous driving, the models trained on the sunny days should have the ability to generalize to the specific rainy or foggy scenarios.

To address the above SDA, we propose an innovative method, referred to as Self-Adversarial Disentangling (SAD). It tackles the problem by disentangling the latent representations into domainness-invariant features and domainness-specific features in a specific dimension. In contrast to domain-invariant feature that involves some noise factors, domainness-invariant feature means the feature is irrelevant to the domainness magnitude in the target domain. The advantage of transferring domainness-invariant features is that we can capture the generalizations across different domainness to narrow the intra-domain gaps, which is on a more fine-grained level than directly transferring domain-invariant features.

Our framework consists of two key components, i.e.,

Domainness Creator (DC) and Self-Adversarial Regularizer (SAR), for domainness generation and disentanglement, respectively. According to the given domain shift, we firstly enrich the source domain with DC. It not only diversifies the source domain but also provides additional supervisory signals for the following feature disentangling. Guided by the domainness, we design the SAR, and introduce a domainness-specific loss and a domainness-invariant loss for SAR to jointly supervise the disentangling of the latent representations into domainness-specific and -invariant features. With the domainness-specific loss, our SAR can classify the predicted domainness with supervisory signals from DC. Penalized by the domainness-invariant loss, our SAR can fully learn domainness-invariant representations. Thus, we can mitigate the intra-domain gap induced by different domainness. To sum up, our SAD framework works in a disentangling sense, which enables the model to learn domainness-invariant feature in an adversarial manner,

i.e., two opposite loss functions.

Our method is applicable and flexible in most real-world cases. We verified the proposed method under various domain dimensions, including cross-fog (Cityscapes 

[8] to Foggy Cityscapes [49], Cityscapes [8] to RTTS [32], Cityscapes [8] to Foggy Zurich++ [48, 49]), cross-rain (Cityscapes [8] to RainCityscapes [27]) and cross-FoV adaptation (Virtual KITTI [13] to CKITTI [15, 8]). The target domain has either single or multiple domainness values. Extensive experiments prove the impressive generalization abilities of our method. Without bells and whistles, our method yields remarkable improvements over existing methods in both object detection and semantic segmentation. In particular, we achieve gains on synthetic datasets and improvements of up to on real datasets. We achieve 45.2 mAP on the widely-used benchmark of Cityscapes [8] to Foggy Cityscapes [49].

The contributions of this paper are summarized as follows. 1) We study the problem of specific domain adaptation (SDA), a realistic and practical setting for domain adaptation. We propose a novel self-adversarial disentangling framework by leveraging the explicit prior domain knowledge to learn the domainness-invariant features. 2) We present a domainness creator for specifically enriching the source domain and providing explicit supervisory signals. 3) We design a self-adversarial regularizer to mitigate the intra-domain gaps. We also introduce one domainness-specific loss and a domainness-invariant loss to facilitate the training. 4) We conduct comprehensive experiments to demonstrate the effectiveness of our method on both object detection and semantic segmentation. It is simple to integrate our method into any existing approaches as a plug-and-play framework which does not introduce any extra costs during the inference phase.

Fig. 2: Overview of the proposed Self-Adversarial Disentangling (SAD). Our Domainness Creator (DC) not only generates a diversified source image with random domainness, but also provides additional supervisory signals for guiding the self-adversarial learning. The encoder and are to extract the domainness-specific representations and the domainness-invariant representations , respectively. With the guidance of the generated domainness, , and SAR (Self-Adversarial Regularizer) work in an adversarial manner, i.e., two opposite loss functions, to disentangle the latent representations into and .

Ii Related Work

Unsupervised Domain Adaptation. UDA aims to generalize the model learned from the labeled source domain to another unlabeled target domain. In the field of UDA, a group of approaches has shown promising results in object detection [6, 47, 23, 76, 63, 3, 62, 51, 26, 24, 53, 70, 72, 19] and semantic segmentation [75, 39, 25, 55, 33, 29, 40, 61, 2, 67, 37, 66, 43, 65, 60, 58, 7, 38, 79, 78, 77, 56, 20, 74, 73]. The current mainstream approaches of these two tasks include adversarial learning [23, 47, 76, 55, 40, 57], self-training [78, 79, 10, 11] and self-ensembling [12, 7, 44, 73, 42, 54, 41, 1, 64]. Despite the gratifying progress, little attention has been paid to perform domain adaptation in a specifically demanded dimension by introducing any explicit prior knowledge about the domain shifts except  [53]. Prior DA [53] is the only work that builds on a similar motivation with us by using the weather-specific prior knowledge obtained from the image formation. However, it designed a prior-adversarial loss and acts in a completely different manner from ours. Prior DA [53] only explored the weather prior on the cross-fog and cross-rain scenarios. We follow the same setting as [53] by knowing the domain dimension in advance, which is fully fair in experimental comparison.

Domain Diversification. Domain Diversification (DD) aims to diversify the source domain to various distinctive domains with random augmentation. Kim et al. [30] designed a DD-MRL method by using GAN [18] to diversify the source domain. Similarly, DRPC [68] and LTIR [29] proposed to diversify the texture of the source images and to learn texture-invariant representations. Our method differs from these methods in several aspects.

Firstly, they require large computation costs and cannot be trained end-to-end during the adaptation procedure. Instead, our method is light-weighted and online with a transformation algorithm in DC. Secondly, the GAN-based approaches tend to produce artifacts for urban-scene datasets, leading to severe semantic inconsistency. In contrast, we do not use any feature interpolation operation in the reconstruction and merely use a simple yet very effective parameter modeling.

Disentangled Learning. Disentangled learning has been widely studied in other communities, e.g., image translation [28, 31], few-shot learning [46, 50]. A few works have recently extended it into domain adaptation by disentangling the latent representations into domain-specific and domain-invariant features to realize effective domain alignment. Liu et al. proposed a model of cross-domain representation disentanglement (CDRD) [35] based on the GAN [18] framework. Chang et al. designed a domain invariant structure extraction (DISE) framework [2] to disentangle the latent encodings into the domain-invariant structure and domain-specific texture representations for domain-adaptive semantic segmentation. Nevertheless, these methods can hardly capture the generalizations across different domainness within the same target domain to narrow down the intra-domain gap.

Multi Domain-invariant Representation Learning. The most relevant work to ours in the filed of UDA is multi domain-invariant representation learning (MRL) [30]. MRL applies a multi-domain discriminator to learn indistinguishable features among different domains. Similarly, Wang et al. [59] proposed a multi-domain discriminator that models the encoding-conditioned domain index distribution to tackle the continuously indexed domain adaptation (CIDA). Our method is quite different from these methods. Firstly, these approaches do not leverage the prior domain knowledge in a specific dimension to facilitate the multi-domain representation learning. In contrast, we model a specific dimension in domain shift and leverage the generated domainness as supervisory signals to guide the feature learning. Secondly, the multi-domain discriminators are not actually reconstructing the domainness values, and they neglect the intra-domain gaps within the target domain induced by different domainness. Instead, guided by the domainness-invariant and domainness-specific loss functions, our SAR works in a completely different manner to narrow the intra-domain gaps.

Iii Methodology

We focus on the problem of Specific Domain Adaptation (SDA) in both object detection and semantic segmentation, where we have access to the source data with labels and the target data without labels. Fig. 2 shows the overview of our framework. Our core idea is to disentangle the latent representation into domainness-invariant feature and domainness-specific feature in a specific dimension. The target domain has either single or multiple domainness values.

Iii-a Domainness Creator

We design Domainness Creator (DC) as a transformation algorithm for images. DC receives a source image as the input and outputs a processed image by adding a random domainness in a specific dimension. Meanwhile, DC provides a supervisory signal, i.e.,, the label of the domainness value , for guiding the self-adversarial learning. Due to the variations of domainness values enabled by DC, a model trained on the domainness-diversified dataset will be able to learn the domainness-invariant representations for feature alignment. is a number, e.g., is . denotes the FoV in the axis.

Example of DC in FoV dimension. Taking FoV as an example, we show the process of FoV transformation given a selected in Fig. 3, where is the optical center of the camera and is the focal point. denotes the focal length. and represent the original width and the new width before and after the transformation:

(1)

where is reduced from to during the process and the domainness label is denoted as . If the dimension is fog thickness for DC, we follow the algorithms in [49] to diversify the source image.

Remark 1: The intuitions between DC and data augmentation are totally different. Data augmentation only diversifies the source images, while the proposed DC provides supervision signals to guide the training of SAR, which is the critical part of the method. DC indeed helps the disentanglement of learning the domainness-invariant representations. We also prove that the two components are complementary in the experiment part.

Remark 2: Difference from existing GAN-based domain diversification. Compared to previous domain diversification [30] and domain randomization [68] approaches, we do not use GAN-based architecture to produce translated images in our implementations. The main reasons are reflected in two aspects. Firstly, these methods require large computation resources, especially for the urban-scene images, and they cannot be trained in an end-to-end manner. Secondly, the reconstruction of the feature encodings will inevitably lead to pixel-wise distortions and semantic inconsistency. In comparison, our transformation in DC is online and allows end-to-end training, because we do not use any feature interpolation operation and just use a simple yet effective mathematically modeled transformation.

Fig. 3: The transform. is the optical center of the camera and is the focal point. is the focal length, and represent the original width and the new width before and after the transformation, respectively. is reduced from to after the process.

Iii-B Self-Adversarial Regularizer

Guided by the generated domainness, SAR is designed to disentangle the latent representations into the domainness-specific feature and the domainness-invariant feature , in order to mitigate the intra-domain gap. and denote the domainness-specific encoder and domainness-invariant encoder, respectively. The dimensions of and are both , where is 19/11 for segmentation, and 512/1024 for detection, respectively.

Intra-domain adaptation.

As shown in Fig. 2, the processed image is fed into the encoder and to get the latent feature map and . Either or is forwarded into SAR to get the domainness value and for once. SAR is supervised by the designed domainness-specific loss and domainness-invariant loss together (see below for the design of these two losses). With the former loss , our SAR could classify the predicted domainness with supervisory signals from DC. Penalized by the latter loss , our SAR could fully learn domainness-invariant representations, thus mitigating the intra-domain gap induced by different domainness. In essence, the encoders and SAR are complementary and work in an adversarial manner (i.e., two opposite losses) to perform the specific domain adaptation. We illustrate the network details and the two loss fuctions below.

Network architecture of SAR.

Note that we use the same SAR architecture for both detection and segmentation tasks. Our SAR only takes one feature map or

at a time as input. After that, we downsample the whole feature map to predict domainness value, and then flatten the downsampled feature map. Then after two fully-connected layers with a relu activation, we get the domainness value

or , as shown in Fig. 2. In practice, we use ROI Align [21] to downsample the whole feature map to predict domainness value. We discretize the continuous domainness values into numbers (representing ranges) for better experimental results. ,

are one-hot vectors with

dimensions. is a

dimensional vector of the uniform distribution.

Domainness-specific loss.

On one hand, with the generated domainness as a supervisory signal, SAR needs to enhance its discriminativity for classifying the diversified images with different domainness more accurately. We define the domainness-specific loss as a cross-entropy loss for optimizing the features from the encoder :

(2)

where is now used as the one-hot vector of generated domainness and is the predicted domainness value of SAR.

Domainness-invariant loss.

On the other hand, SAR needs to maximize the discrepancy between the domainness-invariant feature and the domainness-specific feature . We define the domainness-invariant loss as the KL-divergence between the predicted domainness and a uniform distribution:

(3)

where is sampled from a uniform distribution ,

is the probability of

, and denotes the distribution of domainness .

By jointly minimizing the domainness-invariant loss and the domainness-specific loss in two inverse directions, SAR can fully learn the domainness-invariant features which capture the generalizations across different domainness, thus narrowing the intra-domain gap.

Remark 1: Whether the parameters of and are shared or not. and are two encoders that use the same architecture but do not share the weights, because they are penalized by different loss functions. The former is penalized by , and the latter is under the guidance of , (adversarial loss, Eq. (4)) and (task loss, Eq. (6)).

Remark 2: Comparing with GAN architecture. Existing GAN-based architectures utilized the multi-domain discriminators [30, 59] to distinguish the domainness (they called domain index in their work). In the adversarial framework, these discriminators are not actually predicting the domainness , but making the latent encodings unable to predict . Due to the fact that it is trained in an adversarial way, the encoder will transform the input before outputting encoding , thereby removing the information related to domainness . However, the encoder can not fully learn the domainness-invariant feature due to the lack of prior knowledge about the domain shifts. In comparison, our proposed framework acts in a completely different manner. Firstly, we use two separate encoders and , the former for extracting the domainness-specific feature and the latter for extracting the domainness-invariant features . Secondly, with the guidance of the generated domainness as supervisory signals, our SAR is truly reconstructing the domainness, aiming to distinguish the domainness accurately; Thirdly, our SAD framework works in two opposite directions in a disentangling sense, which enables the model to learn domainness-invariant features to alleviate the intra-domain gap.

Iii-C End-to-End Training and Inference

In this section, we will briefly introduce the inter-domain adaptation, the task loss and formulate an overall loss function for end-to-end training. Then we will explain the inference phase.

Inter-domain adaptation.

Without loss of generality, we employ an adversarial framework [14] for the inter-domain adaptation. As shown in Fig. 2, the processed source images and are fed into the encoder . Then, is encouraged to learn . The latent encodings should confuse a domain discriminator

in distinguishing the features extracted between the source and target domains. This is achieved by min-maximizing an adversarial loss:

(4)

Task loss. In this work, taking Faster-RCNN [45] as an example, we use Region Proposal Network (RPN) to generate Region of Interests (RoIs). It then localizes and classifies the regions to obtain semantic labels and locations. The task network is optimized with a multi-task loss function:

(5)

where the RPN loss , classification loss and regression loss remain the same as [45]. The loss weights and are set to 1.0 by default.

Total loss. During training, all the models are jointly trained with the backbone in an end-to-end manner. The total loss is the weighted sum of the aforementioned loss functions:

(6)

where and are the weighting coefficients for the loss and , respectively. We use the original weighting ratio in [6, 47, 51, 63, 55, 40, 61] to balance and .

Inference phase. In the inference phase, we only need a domainness-invariant encoder with a task network to make predictions. In other words, all other modules including DC, SAR and are removed in the inference stage, leading to no extra costs in prediction. Besides, our method can be plugged into various existing cross-domain detection/segmentation methods. Thus, our framework is flexible and generalizable, and it does not depend on specific UDA frameworks for feature alignment.

Methods Venue person rider car truck bus train motor bicycle mAP
Source-Only [45] NeurIPS’15 26.9 38.2 35.6 18.3 32.4 9.6 25.8 28.6 26.9
DA-Faster [6] CVPR’18 29.2 40.4 43.4 19.7 38.3 28.5 23.7 32.7 32.0
SCDA [76] CVPR’19 33.5 38.0 48.5 26.5 39.0 23.3 28.0 33.6 33.8
DD-MRL [30] CVPR’19 31.8 40.5 51.0 20.9 41.8 34.3 26.6 32.4 34.9
MAF [23] ICCV’19 28.2 39.5 43.9 23.8 39.9 33.3 29.2 33.9 34.0
ART-PSA [72] CVPR’20 34.0 46.9 52.1 30.8 43.2 29.9 34.7 37.4 38.6
ICR-CCR [62] CVPR’20 32.9 43.8 49.2 27.2 45.1 36.4 30.3 34.6 37.4
CST [70] ECCV’20 32.7 44.4 50.1 21.7 45.6 25.4 30.1 36.8 35.9
ATF [24] ECCV’20 34.6 47.0 50.0 23.7 43.3 38.7 33.4 38.8 38.7
Prior DA [53] ECCV’20 36.4 47.3 51.7 22.8 47.6 34.1 36.0 38.7 39.3
GPA [63] CVPR’20 32.9 46.7 54.1 24.7 45.7 41.1 32.4 38.7 39.5
Ours (with GPA [63]) - 38.3 47.2 58.8 34.9 57.7 48.3 35.7 42.0 45.2
(a) Cityscapes to Foggy Cityscapes (single-domainness).
Methods car bus person motor bicycle mAP
Source-Only 39.8 11.7 46.6 19.0 37.0 30.9
DCPDN [69] 39.5 12.9 48.7 19.7 37.5 31.6
Grid-Dehaze [34] 25.4 10.9 29.7 13.0 21.4 20.0
DA-Faster [6] 43.7 16.0 42.5 18.3 32.8 30.7
SWDA [47] 44.2 16.6 40.1 23.2 41.3 33.1
Ours (with [6]) 45.0 15.9 42.0 22.2 38.4 32.7
Ours (with [47]) 47.0 16.6 41.5 27.2 43.2 35.1
(b) Cityscapes to RTTS (multi-domainness).
Comparisons Car AP Gain
DA-Faster [6] 45.1 2.6
Ours (with [6]) 47.7
SWDA [47] 49.0 1.7
Ours (with [47]) 50.7
SCL [51] 49.5 1.8
Ours (with [51]) 51.3
(c) Virtual KITTI to CKITTI (multi-domainness)
TABLE I: Cross-fog (a,b) and cross-FoV (c) adaptation of object detection.
Methods person rider car truck bus motor bicycle mAP Gain
DA-Faster [6] 22.9 55.2 43.4 3.9 58.8 15.2 30.0 32.8 6.4
Ours (with DA-Faster [6]) 26.3 60.1 52.6 13.0 60.3 27.0 34.9 39.2
SWDA [47] 23.8 52.1 46.4 9.6 68.2 16.0 32.8 35.6 3.4
Ours (with SWDA [47]) 25.9 56.0 52.5 8.1 56.0 29.4 33.1 39.0
SCL [51] 27.0 57.9 50.3 10.0 67.9 13.9 33.9 37.3 4.2
Ours (with SCL [51]) 29.3 61.0 52.7 19.2 68.2 26.2 34.1 41.5
TABLE II: Cross-rain adaptation of object detection from Cityscapes to RainCityscapes (multi-domainness).

Iv Experiments

In this section, we firstly evaluate our framework on object detection under various domain dimensions, including cross-fog adaptation, cross-rain adaptation and cross-FoV adaptation. In addition, we extend our method to the semantic segmentation to verify its scalability and applicability. Finally, we conduct ablation studies to validate each component of our method. Our method is applicable and flexible in most real-world cases, and we proved it with thorough experiments.

Iv-a Datasets

Cityscapes Foggy Cityscapes.

This is a widely-used benchmark for cross-domain object detection. Cityscapes  [8] is a dataset focused on autonomous driving, which consists of 2,975 images in the training set, and 500 images in the validation set. Foggy Cityscapes [49] is a synthetic foggy dataset which simulates fog on real scenes. The annotations and data split in Foggy Cityscapes are inherited from Cityscapes.

Cityscapes Rtts.

RTTS [32] is the largest available dataset for object detection under real-world hazy conditions. It contains 4,807 unannotated and 4,322 annotated real-world hazy images covering most traffic and driving scenarios with 7 kinds of fogs.

Cityscapes Foggy Zurich++.

Foggy Zurich++ is a real-world dataset collected in foggy-weather conditions for segmentation. We use all the unannotated 3,768 images of Foggy Zurich [49] as the training set and mix the validation set of Foggy Driving [48] and Foggy Zurich [49]. Following [8], Foggy Zurich++ is labeled with 19 classes.

Cityscapes RainCityscapes.

RainCityscapes [27] renders Cityscapes images with synthetic rain. Each clear image is rendered with 12 types of rain patterns, including 4 types of drop sizes which we use as our domainness. The annotations are the same as those of Cityscapes. We use this benchmark in cross-domain detection.

Vkitti Ckitti.

We use this benchmark in both detection and segmentation. Virtual KITTI [13] is a photo-realistic synthetic dataset, which contains 21,260 images. It is designed to mimic the conditions of KITTI dataset and has similar scene layouts, camera viewpoints and image resolution to KITTI dataset. CKITTI is a real-world dataset depicting several urban driving scenarios with 5 different kinds of FoVs, which is a mixed dataset of Cityscapes [8] and KITTI [15]. We use the 10,456 images as the training set and 700 images as the validation set.

Iv-B Implementation Details

Object detection.

In our implementation, we follow the training protocol [6, 47, 51, 63] of the Faster-RCNN network. We resize the images of both the source and target domain to 600-pixel height in all experiments as suggested by [6, 47, 51]. Following the aforementioned papers, we use the VGG16 [52]

model pre-trained on ImageNet

[9] as a backbone of DA-Faster [6], SWDA [47] and SCL [51], and the ResNet50 [22] as the backbone of GPA [63]. We set the learning rate to 0.001 for the first 50k iterations and 0.0001 for the rest iterations. The other parameters are set by following the original papers [6, 47, 51, 63].

Semantic segmentation.

Following common UDA protocols [55, 40, 61], we employ the DeepLab-v2 [4] with ResNet 101 backbone [22]

in our implementation. The backbone is pre-trained on ImageNet 

[9]. We reproduce the famous AdaptSegNet [55], CLAN [40] and SIM [61] as our baselines. For our DeepLab-v2 network, we use Adam as the optimizer. The initial learning rate is , which is then decreased using polynomial decay with an exponent of .

Iv-C Domain Adaptation for Object Detection

In this section, we present the results in three dimensions, i.e., cross-fog, cross-rain and cross-FoV adaptation, to show the effectiveness of our approach. We achieve gains on synthetic datasets and improvements of up to on real datasets.

Cross-fog adaptation.

To validate the generalization capability on the cross-fog adaptation, we perform two experiments, where the target domain includes single and multiple domainness values, respectively.

Single domainness within the target domain: In this experiment, we adapt from Cityscapes [8] to Foggy-Cityscapes [49]. Table I (a) presents the comparison results with the state-of-the-art cross-domain detection methods on eight categories. Source-only indicates the baseline Faster RCNN [45] is trained with only source domain data. From the table, we can observe that our method (with GPA [63]) could outperform the state-of-the-arts by . The published Prior-DA [53] builds on a similar motivation as ours by using the weather-specific prior knowledge obtained from the image formation. It designed a prior-adversarial loss and acts in a completely different manner. Also, [53] knows the dimension in advance, which fully proves our fairness of this setting. Our method outperforms Prior-DA [53] by .

Fig. 4: Qualitative results of cross-domain object detection in the Cityscapes [8] Foggy Cityscapes [49] set-up. The two columns plot (a) the predictions of GPA [63] baseline, (b) the predictions of Ours (with GPA [63]). The bounding boxes are colored based on the detector confidence using the shown color map. As we can see from the results, the proposed method is able to produce high confidence predictions and is able to detect more objects in the images.
Method

road

building

pole

light

sign

vegetation

terrain

sky

car

truck

guard rail

mIoU

Gain

AdaptSegNet [55] 88.0 80.6 11.1 17.4 28.4 80.3 29.2 85.2 82.1 29.7 27.5 50.8 1.8
Ours (with AdaptSegNet [55]) 88.4 81.0 9.7 18.9 30.5 80.9 39.1 86.2 83.6 32.6 27.5 52.6
CLAN [40] 88.2 80.0 6.0 17.9 26.7 79.3 36.1 85.7 82.4 28.5 12.3 49.4 1.1
Ours (with CLAN [40]) 88.1 79.9 9.9 19.6 25.3 80.2 38.5 85.9 82.5 29.2 16.4 50.5
SIM [61] 87.3 81.2 16.3 16.1 28.3 81.6 37.6 87.2 82.6 29.3 18.3 51.4 1.8
Ours (with SIM [61]) 86.7 81.9 15.7 17.7 31.7 82.3 48.2 86.6 81.9 32.3 20.4 53.2
TABLE III: Cross-FoV adaptation of semantic segmentation from Virtual KITTI to CKITTI (multi-domainness).
Method mIoU Gain
AdaptSegNet [55] 29.4 5.8
Ours (with AdaptSegNet [55]) 35.2
CLAN [40] 26.8 4.7
Ours (with CLAN [40]) 31.5
SIM [61] 27.0 4.1
Ours (with SIM [61]) 31.1
TABLE IV: Cross-Fog adaptation of semantic segmentation from Cityscapes to Foggy Zurich++ (multi-domainness).

Taking a closer look at per-category performance in Table I (a), our approach achieves the highest AP on those categories. This phenomenon illustrates the effectiveness of Self-Adversarial Disentangling among different classes during the cross-domain detection.

Multiple domainness within the target domain: In this experiment, we adapt from Cityscapes [8] to RTTS dataset [32]. Multi-domainness means there exist 7 kinds of fogs in RTTS dataset. The comparison results with the state-of-the-arts are reported in Table I (b). As for the image dehazing approaches which dehaze the target domain and then trasfer the domain knowledge, DCPDN [69] improves the Faster RCNN performance by . However, Grid-Dehaze [34] does not help the Faster RCNN baseline and results in even worse performance. Table I (b) shows that our method can effectively boost the performance by integrating it into DA-Faster RCNN [6] and SWDA [47]. We can successfully boost the mAP by and , respectively. The benefits of our approach lie in two aspects: (1) our method can be easily adopted as a plug-and-play framework which enables end-to-end training and no extra costs during the inference time. (2) Our approach can not only address the single domainness problem but also tackle more complicated scenarios where multiple domainness exist in the target domain.

Cross-FoV adaptation.

To validate the generalization capability of the proposed method, we also conduct an experiment on the FoV dimension adapting from Virtual KITTI [13] to CKITTI [8, 15]. The adaptation results are reported in Table I (c). Despite the 5 different FoVs in the dataset, our method can always achieve a certain improvement. By plugging into the current state-of-the-art methods, i.e., DA-Faster [6], SWDA [47], SCL [51], our method brings , and increase, respectively.

Cross-rain adaptation.

We conduct experiments from Cityscapes [8] to RainCityscapes [27]. Table II shows the results of adapting the model between different rain scenarios. We reproduce DA-Faster RCNN [6], SWDA [47] and SCL [51] in the same setting. We can see that our method significantly improves the mAP by , and through integrating it into the existing UDA methods.

Fig. 5: Qualitative results of cross-domain semantic segmentation in Virtual KITTI [13] CKITTI [15, 8] (11 classes) set-up. The four columns plot (a) RGB input images, (b) ground truth, (c) the predictions of AdaptsegNet [55] baseline and (d) the predictions of Ours (with AdaptsegNet [55]).

Iv-D Domain Adaptation for Semantic Segmentation

In addition to the above experiments on cross-domain object detection, we also conduct experiments on cross-domain semantic segmentation, to show the scability of our method. In specific, we conduct the cross-FoV adaptation and cross-fog adaptation on semantic segmentation.

Cross-fog adaptation.

In this experiment, we adapt from Cityscapes [8] to Foggy Zurich++ [49, 48] to perform the cross-fog adaptation, where multiple thickness of fogs exist in the target domain. As shown in Table IV, our method outperforms the state-of-the-art methods [55, 40, 61] by , and , respectively.

Our method can handle the cases where a domainness value is never seen in the training stage and we have verified it with experiments. As shown in Table IV, the Foggy Zurich++ has the real fog rather than the synthetic fog, which means the domainness in the validation set is unknown and does not appear in the training set. Our method works well on this dataset, which proves its generalization ability.

Cross-FoV adaptation.

In this experiment, we perform the specific domain adaptation given the FoV gap. We choose Virtual KITTI [13] as the source domain and CKITTI [15, 8] as the target domain. The comparison results are listed in Table III. Compared with the AdaptSegNet [55], CLAN [40] and SIM [61], our method respectively yields an increase of , and , which indicates the effectiveness of the proposed SAD in the semantic segmentation task and shows its good scalability.

Fig. 6: Examples of diversified source images produced by our Domainness Creator with different s.

Iv-E Ablation Studies

In this section, we perform ablation experiments to investigate the effect of each component and provide more insights of our method.

Baseline Diversificator Adaptor mIoU Gain
[55] - - 29.4 -
- MRL [30] 29.9 0.5
- CIDA [59] 30.0 0.6
- Ours (SAR) 30.8 1.4
DD [30] - 32.3 2.9
Ours (DC) - 33.5 4.1
DD [30] MRL [30] 33.7 4.3
Ours (DC) MRL [30] 34.0 4.6
Ours (DC) CIDA [59] 34.2 4.8
Ours (DC) Ours (SAR) 35.2 5.8
TABLE V: Ablation on Cityscapes to Foggy Zurich++.

Comparisons to the related work. Table V shows the comparisons to the relevant work [30, 59] from Cityscapes [8] to Foggy Zurich++ [49, 48] under the same baseline. When using MRL [30] or CIDA [59] as the adaptor, it merely achieves a limited improvement of or . In contrast, SAR contributes to the performance gain of . The diversificator, e.g., GAN-based DD [30] only brings gain over the baseline, while our DC boosts the baseline by .

The main reasons are twofold. (1) Previous GAN-based methods [30, 59] do not utilize supervisory signals from DC to fully learn the domainness-invariant feature. (2) They neglect the intra-domain gap induced by different domainness. Instead, our method not only leverages the prior supervisory signals but also mitigates the intra-domain gap across different domainness. Incorporating DC and SAR into the same framework boosts the mIoU by over the baseline. This confirms the effectiveness of our proposed DC and SAR, and addresses the aforementioned claim in Section III-B that our SAD framework is superior to GAN.

Effects of different components. Table VI summarizes the effects of different design components on Cityscapes [8] Foggy Cityscapes [49]. The GPA [63] baseline is . By adding the DC and SAR sequentially, we boost the mAP with an additional and , achieving and , respectively. These improvements in object detection show the effects of individual components of our proposed approach. It also reveals that these two components are complementary and together they significantly promote the performance.

GPA [63] DC SAR mAP Gain
39.5 -
42.5 3.0
45.2 5.7
TABLE VI: Ablation on Cityscapes to Foggy Cityscapes.
Abaltions mIoU
baseline(AdaptSegNet [55]) 29.4
Ours (w/o ) 34.0
Ours (w ) 35.2
TABLE VII: Ablation of the domainnness-specific loss .
Fig. 7: Parameter analysis on the hyper-parameter .

Effects of loss functions. Table VII shows the ablation of the domainnness-specific loss when adapting from Virtual KITTI to CKITTI for segmentation. The full framework with both DC and SAR can achieve 35.2 mIoU. By removing the domainness-specific loss during the training process, the overall performance will drop by . In addition, domainnness-invariant loss is critical for learning the domainnesss-invariant representations in the intra-domain adaptation, and cannot be removed. This shows that our SAR (the regularizer) needs to be trained under the guidance of both loss functions, i.e., and . Therefore, we cannot remove any of them.

Iv-F Parameter Analysis

In this section, we investigate the sensitivity of the hyper-parameter which balances the domain adaptation process. In Fig. 7, we plot the performance curve of models trained with different values on the setting of Cityscapes Foggy Cityscapes in object detection task. The highest mAP on target domain is achieved when the value of is around , which means that this weight among different loss functions benefits domain adaptation the most. We simply set the same in all experiments, to show the robustness of our method in different settings. Note that we use the original weighting ratio in [6, 47, 51, 63, 55, 40, 61] to balance and .

Iv-G Qualitative Results

Fig. 4 visualizes the qualitative results of cross-domain object detection on two benchmarks, Cityscapes [8] Foggy Cityscapes [49] and Cityscapes [8] RTTS [32], respectively. As we can see from the pictures, the proposed method is able to produce high confidence predictions and is able to detect more objects when plugging into the current state-of-the-art methods, e.g., GPA and SWDA [47].

Fig. 5 shows the qualitative results of cross-domain semantic segmentation from Virtual KITTI dataset [13] to CKITTI [15, 8]. With the aid of our proposed Self-Adversarial Disentangling framework, the models are able to produce correct predictions at a high level of confidence, e.g., plugging it into AdaptSegNet [55]. As we can see from the figures, our method enables good performance on most categories, e.g., ‘vegetation’, ‘terrain’, ‘car’, ’truck’, and ‘traffic sign’ classes.

Fig. 6 shows the output results of Domainness Creator when receiving a source image at a time given the FoV gap. We get a series of diversified images with different s. From left to right it displays the processed image with of , , and , respectively. Due to the increased variations of domainness, a model trained on this domainness-diversified dataset is able to learn the domainness-invariant representation for feature alignment.

V Conclusion

In this paper, we studied specific domain adaptation and proposed self-adversarial disentangling to learn the domainness-invariant feature in a specific dimension. The domainness creator aims to enrich the source domain and to provide additional supervisory signals for fully learning the domainness-invariant feature. The self-adversarial regularizer and two losses are introduced to narrow the intra-domain gap induced by different domainness. Extensive experiments validate our method on object detection and semantic segmentation under various domain-shift settings. Our method can be easily integrated into state-of-the-art architectures to attain considerable performance gains.

Acknowledgment

This work is supported by National Key Research and Development Program of China (No. 2019YFC1521104), National Natural Science Foundation of China (No. 61972157), Zhejiang Lab (No. 2020NB0AB01) and Shanghai Municipal Science, Technology Major Project (No. 2021SHZDZX0102) and Shanghai Science and Technology Commission (No. 21511101200). The author Qianyu Zhou is supported by Wu Wenjun Honorary Doctoral Scholarship, AI Institute, Shanghai Jiao Tong University.

References

  • [1] N. Araslanov and S. Roth (2021) Self-supervised augmentation consistency for adapting semantic segmentation. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 15384–15394. Cited by: §II.
  • [2] W. Chang, H. Wang, W. Peng, and W. Chiu (2019) All about structure: adapting structural information across domains for boosting semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900–1909. Cited by: §I, §II, §II.
  • [3] C. Chen, Z. Zheng, X. Ding, Y. Huang, and Q. Dou (2020) Harmonizing transferability and discriminability for adapting object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8869–8878. Cited by: §I, §II.
  • [4] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2018) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40 (4), pp. 834–848. Cited by: §I, §IV-B.
  • [5] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pp. 801–818. Cited by: §I.
  • [6] Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool (2018) Domain adaptive faster r-cnn for object detection in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3339–3348. Cited by: §I, §II, §III-C, (a)a, (b)b, (c)c, TABLE II, §IV-B, §IV-C, §IV-C, §IV-C, §IV-F.
  • [7] J. Choi, T. Kim, and C. Kim (2019) Self-ensembling with gan-based data augmentation for domain adaptation in semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6830–6840. Cited by: §I, §II.
  • [8] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    .
    In Proc. CVPR, pp. 3213–3223. Cited by: §I, Fig. 4, Fig. 5, §IV-A, §IV-A, §IV-A, §IV-C, §IV-C, §IV-C, §IV-C, §IV-D, §IV-D, §IV-E, §IV-E, §IV-G, §IV-G.
  • [9] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §IV-B, §IV-B.
  • [10] Z. Feng, Q. Zhou, G. Cheng, X. Tan, J. Shi, and L. Ma (2020) Semi-supervised semantic segmentation via dynamic self-training and classbalanced curriculum. arXiv preprint arXiv:2004.08514 1 (2), pp. 5. Cited by: §II.
  • [11] Z. Feng, Q. Zhou, Q. Gu, X. Tan, G. Cheng, X. Lu, J. Shi, and L. Ma (2020)

    Dmt: dynamic mutual training for semi-supervised learning

    .
    arXiv preprint arXiv:2004.08514. Cited by: §I, §II.
  • [12] G. French, M. Mackiewicz, and M. Fisher (2018) Self-ensembling for visual domain adaptation. In Proceedings of the International Conference on Learning Representations, Cited by: §II.
  • [13] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig (2016) Virtual worlds as proxy for multi-object tracking analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4340–4349. Cited by: §I, Fig. 5, §IV-A, §IV-C, §IV-D, §IV-G.
  • [14] Y. Ganin and V. Lempitsky (2015)

    Unsupervised domain adaptation by backpropagation

    .
    In

    International conference on machine learning

    ,
    pp. 1180–1189. Cited by: §III-C.
  • [15] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the kitti dataset. IJR 32 (11), pp. 1231–1237. Cited by: §I, Fig. 5, §IV-A, §IV-C, §IV-D, §IV-G.
  • [16] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §I.
  • [17] R. Girshick (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §I.
  • [18] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. Vol. 27. Cited by: §II, §II.
  • [19] Q. Gu, Q. Zhou, M. Xu, Z. Feng, G. Cheng, X. Lu, J. Shi, and L. Ma (2021) PIT: position-invariant transform for cross-fov domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Cited by: §I, §II.
  • [20] S. Guo, Q. Zhou, Y. Zhou, Q. Gu, J. Tang, Z. Feng, and L. Ma (2021)

    Label-free regional consistency for image-to-image translation

    .
    In 2021 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. Cited by: §I, §II.
  • [21] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §III-B.
  • [22] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §IV-B, §IV-B.
  • [23] Z. He and L. Zhang (2019) Multi-adversarial faster-rcnn for unrestricted object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6668–6677. Cited by: §I, §II, (a)a.
  • [24] Z. He and L. Zhang (2020) Domain adaptive object detection via asymmetric tri-way faster-rcnn. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16, pp. 309–324. Cited by: §I, §II, (a)a.
  • [25] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell (2018) Cycada: cycle-consistent adversarial domain adaptation. In International conference on machine learning, pp. 1989–1998. Cited by: §I, §II.
  • [26] C. Hsu, Y. Tsai, Y. Lin, and M. Yang (2020) Every pixel matters: center-aware feature alignment for domain adaptive object detector. In European Conference on Computer Vision, pp. 733–748. Cited by: §I, §II.
  • [27] X. Hu, C. Fu, L. Zhu, and P. Heng (2019) Depth-attentional features for single-image rain removal. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8022–8031. Cited by: §I, §IV-A, §IV-C.
  • [28] X. Huang, M. Liu, S. Belongie, and J. Kautz (2018) Multimodal unsupervised image-to-image translation. In Proceedings of the European conference on computer vision, pp. 172–189. Cited by: §II.
  • [29] M. Kim and H. Byun (2020) Learning texture invariant representation for domain adaptation of semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12975–12984. Cited by: §I, §II, §II.
  • [30] T. Kim, M. Jeong, S. Kim, S. Choi, and C. Kim (2019) Diversify and match: a domain adaptive representation learning paradigm for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12456–12465. Cited by: §II, §II, §III-A, §III-B, (a)a, §IV-E, §IV-E, TABLE V.
  • [31] H. Lee, H. Tseng, J. Huang, M. Singh, and M. Yang (2018) Diverse image-to-image translation via disentangled representations. In Proceedings of the European conference on computer vision (ECCV), pp. 35–51. Cited by: §II.
  • [32] B. Li, W. Ren, D. Fu, D. Tao, D. Feng, W. Zeng, and Z. Wang (2018) Benchmarking single-image dehazing and beyond. IEEE Transactions on Image Processing 28 (1), pp. 492–505. Cited by: §I, §IV-A, §IV-C, §IV-G.
  • [33] Y. Li, L. Yuan, and N. Vasconcelos (2019) Bidirectional learning for domain adaptation of semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6936–6945. Cited by: §I, §II.
  • [34] X. Liu, Y. Ma, Z. Shi, and J. Chen (2019) Griddehazenet: attention-based multi-scale network for image dehazing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7314–7323. Cited by: (b)b, §IV-C.
  • [35] Y. Liu, Y. Yeh, T. Fu, S. Wang, W. Chiu, and Y. F. Wang (2018) Detach and adapt: learning cross-domain disentangled deep representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8867–8876. Cited by: §II.
  • [36] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §I.
  • [37] Z. Lu, Y. Yang, X. Zhu, C. Liu, Y. Song, and T. Xiang (2020) Stochastic classifiers for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9111–9120. Cited by: §I, §II.
  • [38] Y. Luo, P. Liu, T. Guan, J. Yu, and Y. Yang (2019) Significance-aware information bottleneck for domain adaptive semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6778–6787. Cited by: §I, §II.
  • [39] Y. Luo, P. Liu, L. Zheng, T. Guan, J. Yu, and Y. Yang (2021) Category-level adversarial adaptation for semantic segmentation using purified features. IEEE Transactions on Pattern Analysis and Machine Intelligence (), pp. 1–1. External Links: Document Cited by: §I, §II.
  • [40] Y. Luo, L. Zheng, T. Guan, J. Yu, and Y. Yang (2019) Taking a closer look at domain shift: category-level adversaries for semantics consistent domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2507–2516. Cited by: §I, §II, §III-C, §IV-B, §IV-D, §IV-D, §IV-F, TABLE III, TABLE IV.
  • [41] L. Melas-Kyriazi and A. K. Manrai (2021) PixMatch: unsupervised domain adaptation via pixelwise consistency training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12435–12445. Cited by: §II.
  • [42] V. Olsson, W. Tranheden, J. Pinto, and L. Svensson (2021)

    Classmix: segmentation-based data augmentation for semi-supervised learning

    .
    In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1369–1378. Cited by: §II.
  • [43] F. Pan, I. Shin, F. Rameau, S. Lee, and I. S. Kweon (2020) Unsupervised intra-domain adaptation for semantic segmentation through self-supervision. In Unsupervised Intra-domain Adaptation for Semantic Segmentation through Self-Supervision, pp. 3764–3773. Cited by: §I, §II.
  • [44] C. S. Perone, P. Ballester, R. C. Barros, and J. Cohen-Adad (2019) Unsupervised domain adaptation for medical imaging segmentation with self-ensembling. NeuroImage 194, pp. 1–11. Cited by: §II.
  • [45] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Vol. 28, pp. 91–99. Cited by: §I, §III-C, (a)a, §IV-C.
  • [46] K. Ridgeway and M. C. Mozer (2018) Learning deep disentangled embeddings with the f-statistic loss. Cited by: §II.
  • [47] K. Saito, Y. Ushiku, T. Harada, and K. Saenko (2019) Strong-weak distribution alignment for adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6956–6965. Cited by: §I, §II, §III-C, (b)b, (c)c, TABLE II, §IV-B, §IV-C, §IV-C, §IV-C, §IV-F, §IV-G.
  • [48] C. Sakaridis, D. Dai, S. Hecker, and L. Van Gool (2018) Model adaptation with synthetic and real data for semantic dense foggy scene understanding. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 687–704. Cited by: §I, §IV-A, §IV-D, §IV-E.
  • [49] C. Sakaridis, D. Dai, and L. Van Gool (2018) Semantic foggy scene understanding with synthetic data. International Journal of Computer Vision 126 (9), pp. 973–992. Cited by: §I, §III-A, Fig. 4, §IV-A, §IV-A, §IV-C, §IV-D, §IV-E, §IV-E, §IV-G.
  • [50] T. R. Scott, K. Ridgeway, and M. C. Mozer (2018) Adapted deep embeddings: a synthesis of methods for

    -shot inductive transfer learning

    .
    Cited by: §II.
  • [51] Z. Shen, H. Maheshwari, W. Yao, and M. Savvides (2019) Scl: towards accurate domain adaptive object detection via gradient detach based stacked complementary losses. arXiv preprint arXiv:1911.02559. Cited by: §I, §II, §III-C, (c)c, TABLE II, §IV-B, §IV-C, §IV-C, §IV-F.
  • [52] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §IV-B.
  • [53] V. A. Sindagi, P. Oza, R. Yasarla, and V. M. Patel (2020) Prior-based domain adaptive object detection for hazy and rainy conditions. In European Conference on Computer Vision, pp. 763–780. Cited by: §I, §II, (a)a, §IV-C.
  • [54] W. Tranheden, V. Olsson, J. Pinto, and L. Svensson (2021) DACS: domain adaptation via cross-domain mixed sampling. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1379–1389. Cited by: §II.
  • [55] Y. Tsai, W. Hung, S. Schulter, K. Sohn, M. Yang, and M. Chandraker (2018) Learning to adapt structured output space for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7472–7481. Cited by: §I, §II, §III-C, Fig. 5, §IV-B, §IV-D, §IV-D, §IV-F, §IV-G, TABLE III, TABLE IV, TABLE V, TABLE VII.
  • [56] Y. Tsai, K. Sohn, S. Schulter, and M. Chandraker (2019) Domain adaptation for structured output via discriminative patch representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Cited by: §I, §II.
  • [57] T. Vu, H. Jain, M. Bucher, M. Cord, and P. Pérez (2019) Advent: adversarial entropy minimization for domain adaptation in semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2517–2526. Cited by: §II.
  • [58] T. Vu, H. Jain, M. Bucher, M. Cord, and P. Pérez (2019) DADA: depth-aware domain adaptation in semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7363–7372. Cited by: §I, §II.
  • [59] H. Wang, H. He, and D. Katabi (2020) Continuously indexed domain adaptation. In The International Conference on Machine Learning, Cited by: §II, §III-B, §IV-E, §IV-E, TABLE V.
  • [60] H. Wang, T. Shen, W. Zhang, L. Duan, and T. Mei (2020) Classes matter: A fine-grained adversarial approach to cross-domain semantic segmentation. In European conference on computer vision, Vol. 12359, pp. 642–659. Cited by: §I, §II.
  • [61] Z. Wang, M. Yu, Y. Wei, R. Feris, J. Xiong, W. Hwu, T. S. Huang, and H. Shi (2020) Differential treatment for stuff and things: a simple unsupervised domain adaptation method for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12635–12644. Cited by: §I, §II, §III-C, §IV-B, §IV-D, §IV-D, §IV-F, TABLE III, TABLE IV.
  • [62] C. Xu, X. Zhao, X. Jin, and X. Wei (2020) Exploring categorical regularization for domain adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11724–11733. Cited by: §I, §II, (a)a.
  • [63] M. Xu, H. Wang, B. Ni, Q. Tian, and W. Zhang (2020) Cross-domain detection via graph-induced prototype alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12355–12364. Cited by: §I, §II, §III-C, (a)a, Fig. 4, §IV-B, §IV-C, §IV-E, §IV-F, TABLE VI.
  • [64] Y. Xu, B. Du, L. Zhang, Q. Zhang, G. Wang, and L. Zhang (2019) Self-ensembling attention networks: addressing domain shift for semantic segmentation. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 33, pp. 5581–5588. Cited by: §II.
  • [65] J. Yang, R. Xu, R. Li, X. Qi, X. Shen, G. Li, and L. Lin (2020) An adversarial perturbation oriented domain adaptation approach for semantic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 12613–12620. Cited by: §I, §II.
  • [66] Y. Yang, D. Lao, G. Sundaramoorthi, and S. Soatto (2020) Phase consistent ecological domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9011–9020. Cited by: §I, §II.
  • [67] Y. Yang and S. Soatto (2020) Fda: fourier domain adaptation for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4085–4095. Cited by: §I, §II.
  • [68] X. Yue, Y. Zhang, S. Zhao, A. Sangiovanni-Vincentelli, K. Keutzer, and B. Gong (2019) Domain randomization and pyramid consistency: simulation-to-real generalization without accessing target domain data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2100–2110. Cited by: §II, §III-A.
  • [69] H. Zhang and V. M. Patel (2018) Densely connected pyramid dehazing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3194–3203. Cited by: (b)b, §IV-C.
  • [70] G. Zhao, G. Li, R. Xu, and L. Lin (2020) Collaborative training between region proposal localization and classification for domain adaptive object detection. In European Conference on Computer Vision, pp. 86–102. Cited by: §I, §II, (a)a.
  • [71] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: §I.
  • [72] Y. Zheng, D. Huang, S. Liu, and Y. Wang (2020) Cross-domain object detection through coarse-to-fine feature adaptation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13766–13775. Cited by: §I, §II, (a)a.
  • [73] Q. Zhou, Z. Feng, Q. Gu, G. Cheng, X. Lu, J. Shi, and L. Ma (2020) Uncertainty-aware consistency regularization for cross-domain semantic segmentation. arXiv preprint arXiv:2004.08878. Cited by: §II.
  • [74] Q. Zhou, Z. Feng, Q. Gu, J. Pang, G. Cheng, X. Lu, J. Shi, and L. Ma (2021) Context-aware mixup for domain adaptive semantic segmentation. arXiv preprint arXiv:2108.03557. Cited by: §II.
  • [75] W. Zhou, Y. Wang, J. Chu, J. Yang, X. Bai, and Y. Xu (2020) Affinity space adaptation for semantic segmentation across domains. IEEE Transactions on Image Processing 30, pp. 2549–2561. Cited by: §I, §II.
  • [76] X. Zhu, J. Pang, C. Yang, J. Shi, and D. Lin (2019) Adapting object detectors via selective cross-domain alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 687–696. Cited by: §I, §II, (a)a.
  • [77] X. Zhu, H. Zhou, C. Yang, J. Shi, and D. Lin (2018) Penalizing top performers: conservative loss for semantic segmentation adaptation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 568–583. Cited by: §I, §II.
  • [78] Y. Zou, Z. Yu, B. Kumar, and J. Wang (2018) Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In Proceedings of the European conference on computer vision (ECCV), pp. 289–305. Cited by: §I, §II.
  • [79] Y. Zou, Z. Yu, X. Liu, B. Kumar, and J. Wang (2019) Confidence regularized self-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5982–5991. Cited by: §I, §II.