Decoupled Adaptation for Cross-Domain Object Detection

10/06/2021 ∙ by Junguang Jiang, et al. ∙ Tsinghua University 0

Cross-domain object detection is more challenging than object classification since multiple objects exist in an image and the location of each object is unknown in the unlabeled target domain. As a result, when we adapt features of different objects to enhance the transferability of the detector, the features of the foreground and the background are easy to be confused, which may hurt the discriminability of the detector. Besides, previous methods focused on category adaptation but ignored another important part for object detection, i.e., the adaptation on bounding box regression. To this end, we propose D-adapt, namely Decoupled Adaptation, to decouple the adversarial adaptation and the training of the detector. Besides, we fill the blank of regression domain adaptation in object detection by introducing a bounding box adaptor. Experiments show that D-adapt achieves state-of-the-art results on four cross-domain object detection tasks and yields 17 on benchmark datasets Clipart1k and Comic2k in particular.



There are no comments yet.


page 7

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The object detection task has aroused great interest due to its wide applications. In the past few years, the development of deep neural networks has boosted the performance of object detectors

(detection; RCNN; yolo). While these detectors have achieved excellent performance on the benchmark datasets (pascal; coco)

, object detection in the real world still faces challenges from the large variance in viewpoints, object appearance, backgrounds, illumination, image quality,

etc. Such domain shifts have been observed to cause significant performance drop (wild). Thus, some work uses domain adaptation (transfer_survey) to transfer a detector from a source domain, where sufficient training data is available, to a target domain where only unlabeled data is available (wild; swda). This technique successfully improves the performance of the detector on the target domain. However, the improvement of domain adaptation in object detection remains relatively mild compared with that in object classification.

The inherent challenges come from three aspects. Data challenge: what to adapt in the adversarial training is unknown. Instance feature adaptation in the object level (Figure 1(a)) might confuse the features of the foreground and the background since the generated proposals may not be true objects and many true objects might be missing (see error analysis in Figure 5). Global feature adaptation in the image level (Figure 1(b)) is likely to mix up features of different objects since each input image of detection has multiple objects and their numerous combinations. Local feature adaptation in the pixel level (Figure 1(c)) can alleviate domain shift when the shift is primarily low-level, yet it will struggle when the domains are different at the semantic level. Architecture challenge: while the above adaptation methods encourage domain-invariant features, the discriminability of features might get deteriorated in the adaptation process (bsp; harmonizing)

. However, discriminability is very important for the detector, since it needs to locate and classify objects at the same time. While self-training

self-iccv does not hurt the discriminability of features, they suffer from confirmation bias, especially when the domain shift is large. Loss challenge: previous methods mainly explored the category adaptation and ignored the regression adaptation, which is difficult but important for detection performance.

(a) Instance adapt
(b) Global adapt
(c) Local adapt
(d) Decouple adapt
Figure 1: Comparisons among techniques. Most previous methods can be categorized into instance adaptation, global adaptation, or local adaptation, which perform adaptation on the features of the detector. In decoupled adaptation, the adversarial adaptors are decoupled from the detector, and different adaptors are also decoupled. Decouple means that different parts have independent model parameters, independent input data distributions and independent training losses. Different parts are coordinated into some relationships through data rather than gradients, e.g., different adaptors form a cascading relationship while the detector and the adaptors form a self-feedback relationship.

To overcome these challenges, we propose a general framework – D-adapt, namely Decoupled Adaptation. Since adversarial alignment directly on the features of the detector might hurt its discriminability (architecture challenge), we decouple the adversarial adaptation from the training of the detector by introducing a parameter-independent category adaptor (see Figure 1(d)). To fill the blank of regression domain adaptation in cross-domain detection (loss challenge), we introduce another bounding box adaptor that’s decoupled from both the detector and the category adaptor. To tackle the data challenge, we propose to adjust the object-level data distribution for specific adaptation tasks. This can be easily achieved by D-adapt, in which different adaptors can have completely different input data distributions. For example, in the category adaptation step, we encourage the input proposals to have IoU111The Intersection-over-Union between the proposals and the ground-truth instance. close to or to better satisfy the low-density separation assumption, while in the bounding box adaptation step, we encourage the input proposals to have IoU between and to ease the optimization of the bounding box localization task.

The contributions of this work are summarized as three-fold. (1) We introduce D-adapt framework for cross-domain object detection, which is general both two-stage and single-stage detectors. (2) We propose an effective method to adapt the bounding box localization task, which is ignored by existing methods but is crucial for achieving superior final performance. (3) We conduct extensive experiments and validate that our method achieves state-of-the-art performance on four object detection tasks, and yields 17% and 21% relative improvement on Clipart1k and Comic2k.

2 Related Work

Generic domain adaptation for classification.

Domain adaptation is proposed to overcome the distribution shift across domains. In the classification setting, most of the domain adaptation methods are based on Moment Matching or Adversarial Adaptation. Moment Matching methods


align distributions by minimizing the distribution discrepancy in the feature space. Taking the same spirit as Generative Adversarial Networks

(GAN), Adversarial Adaptation (DANN; CDAN) introduces a domain discriminator to distinguish the source from the target, then the feature extractor is encouraged to fool the discriminator and learn domain invariant features. However, directly applying these methods to object detection yields an unsatisfactory effect. The difficulty is that the image of object detection usually contains multiple objects, thus the features of an image can have complex multimodal structures (every-pixel; dual; harmonizing), making the image-level feature alignment problematic (dual; every-pixel).

Generic domain adaptation for regression.

Most domain adaptation methods designed for classification do not work well on regression tasks since the regression space is continuous with no clear decision boundary (RegDA). Some specific regression algorithms are proposed, including importance weighting (Yamada2013DomainAF) or learning invariant representations (5640675; NIKZADLANGERODI2020106447). RSD (DAR_ICML_21) defines a geometrical distance for learning transferable representations and disparity discrepancy (MDD) proposes an upper bound for the distribution distance in the regression problems. Yet previous methods are mainly tested on simple tasks while this paper extends domain adaptation to the object localization tasks.

Domain adaptation for object detection.

DA-Faster (wild) performs feature alignment at both image-level and instance-level. SWDA (swda) proposes that strong alignment of the local features is more effective than the strong alignment of the global features. Hsu et al. (every-pixel) carries out center-aware alignment by paying more attention to foreground pixels. HTCN (harmonizing) calibrates the transferability of feature representations hierarchically. Zheng et al. (coarse) proposes to extract foreground regions and adopts coarse-to-fine feature adaptation. ATF (triway) introduces an asymmetric tri-way approach to account for the differences in labeling statistics between domains. CRDA (exploring) and MCAR (dual) use multi-label classification as an auxiliary task to regularize the features. However, although the auxiliary task of outputting domain-invariant features to fool a domain discriminator in most aforementioned methods can improve the transferability, it also impairs the discriminability of the detector. In contrast, we decouple the adversarial adaptation and the training of the detector, thus the adaptors could specialize in transfer between domains, and the detector could focus on improving the discriminability while enjoying the transferability brought by the adaptors.

Self-training with pseudo labels.

Pseudo-labeling (self3), which leverages the model itself to obtain labels on unlabeled data, is widely used in self-training. To generate reliable pseudo labels, temporal ensembling (temporal) maintains an exponential moving average prediction for each sample, while the mean-teacher (meanteacher) averages model weights at different training iterations to get a teacher model. Deep mutual learning (deepmutual) trains a pool of student models with supervisions from each other. FixMatch (fixmatch) uses the model’s predictions on weakly-augmented images to generate pseudo-labels for the strongly-augmented ones. Unbiased Teacher (unbiased) introduces the teacher-student paradigm to Semi-Supervised Object Detection (SS-OD). Recent works progressive; robust; self-iccv utilize self-training in cross-domain object detection and take the most confident predictions as pseudo labels. MTOR (teacherdetection) uses the mean teacher framework and UMT (umt) adopts distillation and CycleGAN (CycleGAN) in self-training. However, self-training suffers from the problem of confirmation bias (confrimation; curriculum): the performance of the student will be limited by that of the teacher. Although pseudo labels are also used in our proposed D-adapt, they are generated from adaptors that have independent parameters and different tasks from the detector, thereby alleviating the confirmation bias of the overly tight relationship in self-training.

3 Proposed Method

In supervised object detection, we have a labeled source domain , where is the image, is the bounding box coordinates, and is the categories. The detector is trained with , which consists of four losses in Faster RCNN (Faster): the RPN classification loss , the RPN regression loss , the RoI classification loss and the RoI regression loss ,


In cross-domain object detection, there exists another unlabeled target domain that follows different distributions from . The objective of is to improve the performance on .

3.1 D-adapt Framework

To deal with the architecture challenge mentioned in Section 1, we propose the D-adapt framework, which has three steps: (1) decouple the original cross-domain detection problem into several sub-problems (2) design adaptors to solve each sub-problem (3) coordinate the relationships between different adaptors and the detector.

Since adaptation might hurt the discriminability of the detector, we decouple the category adaptation from the training of the detector by introducing a parameter-independent category adaptor (see Figure 1(d)). The adversarial adaptation is only performed on the features of the category adaptor, thus will not hurt the detector’s ability to locate objects. To fill the blank of regression domain adaptation in object detection, we need to perform adaptation on the bounding box regression. Yet feature visualization in Figure 6(c) reveals that features that contain both category and location information do not have an obvious cluster structure, and adversarial alignment might hurt its discriminability. Besides, the common category adaptation methods are also not effective on regression tasks RegDA, thus we decouple category adaptation and the bounding box adaptation to avoid their interfering with each other. Section 3.2 and 3.3 will introduce the design of category adaptor and box adaptor in details. In this section, we will assume that such two adaptors are already obtained.

To coordinate the adaptation on different tasks, we maintain a cascading relationship between the adaptors. In the cascading structure, the later adaptors can utilize the information obtained by the previous adaptors for better adaptation, e.g. in the box adaptation step, the category adaptor will select foreground proposals to facilitate the training of the box adaptor. Compared with the multi-task learning relationship where we need to balance the weights of different adaptation losses carefully, the cascade relationship greatly reduces the difficulty of hyper-parameter selection since each adaptor has only one adaptation loss. Since the adaptors are specifically designed for cross-domain tasks, their predictions on the target domain can serve as pseudo labels for the detector. On the other hand, the detector generates proposals to train the adaptors and higher-quality proposals can improve the adaptation performance (see Table 5 for details). And this enables the self-feedback relationship between the detector and the adaptors.

For a good initialization of this self-feedback loop, we first pre-train the detector on the source domain with . Using the pre-trained , we can derive two new data distributions, the source proposal distribution and the target proposal distribution . Each proposal consists of a crop of the image 222We use uppercase letters to represent the whole image, lowercase letters to represent an instance of object., its corresponding bounding box , predicted category and the class confidence . We can annotate each source-domain proposal with a ground truth bounding box and category label , similar to labeling each RoI in Fast RCNN (Fast), and then use these labels to train the adaptors. In turn, for each target proposal , adaptors will provide category pseudo label and box pseudo label to train the RoI heads,

input : Source domain and target domain ,
number of iterations
output : Cross-domain object detector
initialize the object detector by optimizing with ;
for  to  do
          generate proposals and for each sample
          in and by ;
          for each mini-batch in and  do
                   train the category adaptor ;
          end for
         generate category label for each proposal in ;
          generate foreground proposals and
         from and ;
          for each mini-batch in and  do
                   train the bounding box adaptor ;
          end for
         generate bounding box label for each proposal in ;
          train the object detector by optimizing with ;
end for
Algorithm 1 D-adapt Training Pipeline.

where is a function that indicates whether it is a foreground class. Note that regression loss is activated only for foreground anchors. After obtaining a better detector by optimizing Equation 2, we can generate higher-quality proposals, which facilitate better category adaptation and bounding box adaptation. This process can iterate multiple times and the detailed optimization procedures are summarized in Algorithm 1.

Note that our D-adapt framework does not introduce any computational overhead in the inference phase, since the adaptors are independent of the detector and can be removed during detection. Also, D-adapt does not depend on a specific detector, thus the detector can be replaced by SSD ssd, RetinaNet retina, or other detectors.

3.2 Category Adaptation

D-adapt decouples category adaptation from the training of the detector to avoid hurting its discriminability. The goal of category adaptation is to use labeled source-domain proposals to obtain a relatively accurate classification of the unlabeled target-domain proposals . Although generic domain adaptation methods, such as DANN (DANN) can be adopted, this task has its own data challenge – the input data distribution doesn’t satisfy the low-density separation assumption well, i.e., the Intersection-over-Union of a proposal and a foreground instance may be any value between 0 and 1 (see Figure 2(a)), which will impede the domain adaptation (RegDA). Recall that in standard object detection, proposals with IoU between and will be removed to discretize the input space and ease the optimization of the classification. Yet it can hardly be used in the domain adaptation problem since we cannot obtain ground truth IoU for target proposals.

(a) IoU distribution of proposals
(b) Discretization
(c) Architecture of the category adaptor
Figure 2: Category adaptation (best viewed in color). (a)

The IoU distribution of the proposals from Foggy Cityscapes. When we increase the confidence threshold from 0 to 0.9, undefined proposals (proposals with IoU between

and ) will decrease. (b) Proposals with lower confidence will be assigned a lower weight in the adaptation to discretize the feature space in a soft way. (c) The discriminator is trained to separate the source-domain proposals from the target-domain proposals for each class independently, while the feature extractor is encouraged to fool .

To overcome the data challenge, we use the confidence of each proposal to discretize the input space in a soft way, i.e., when a proposal has a high confidence being the foreground or background, it should have a higher weight in the adaptation training, and vice versa (see Figure 2(b)

). This will reduce the participation of proposals that are neither foreground nor background and improve the discreteness of the input space in the sense of probability. We also resample background proposals and add them into

and to further increase the discreteness. Then the optimization objective of the discriminator is,


where both the feature representation and the category prediction are fed into the domain discriminator (see Figure 2(c)). This will encourage features aligned in a conditional way CDAN, and thus avoid that most target proposals aligned to the dominant category on the source domain. The objective of the feature extractor is to separate different categories on the source domain and learn domain-invariant features to fool the discriminator,


where is the cross-entropy loss, is the trade-off between source risk and domain adversarial loss. After obtaining the adapted classifier, we can generate category pseudo label for each proposal .

3.3 Bounding Box Adaptation

D-adapt decouples bounding box adaptation from category adaptation to avoid their interfering with each other. The objective of box adaptation is to utilize labeled source-domain foreground proposals to obtain bounding box labels of the unlabeled target-domain proposals . Recall that in object detection, regression loss is activated only for foreground anchor and is disabled otherwise (Fast), thus we only adapt the foreground proposals when training the bounding box regressor. Since the ground truth labels of target-domain proposals are unknown, we use the prediction obtained in the category adaptation step, i.e. .

Following RCNN (RCNN), we adopt a class-specific bounding-box regressor, which predicts the bounding box regression offsets, for each of the foreground classes, indexed by . On the source domain, we have the ground truth category and bounding box label for each proposal, thus we use the smooth loss to train the regressor,


where is the regression prediction, is ground truth category, is the ground truth bounding box offsets calculated from and . However, it’s hard to obtain a satisfactory regressor with on the target domain due to the domain shift.

To tackle this loss challenge, we propose an IoU disparity discrepancy method based on the latest domain adaptation theory (MDD). As shown in Figure 3(a), we introduce an adversarial regressor to maximize its disparity with the main regressor on the target domain. We adopt Generalized Intersection over Union (GIoU) GIoU instead of to define the disparity between the two bounding boxes because GIoU is always bounded between and , while is unbounded and will lead to a numerical explosion during maximization. Then the optimization objective of the adversarial regressor is


Note that GIoU loss on the source domain is only defined on the box corresponding to the ground truth category and that on the target domain is only defined on the box associated with the predicted category . Equation 6 guides the adversarial regressor to predict correctly on the source domain while making as many mistakes as possible on the target domain (Figure 3(b)). Then the feature extractor is encouraged to output domain-invariant features to avoid such cases,


where is the trade-off between source risk and adversarial loss. After obtaining the adapted regressor, we can generate box pseudo label for each proposal .

(a) Architecture of the bounding box adaptor
(b) Minimax on IoU
Figure 3: Bounding box adaptation (best viewed in color). Box adaptor has three parts: feature generator , regressor and adversarial regressor . learns to maximize the target disparity by moving two predicted boxes far from each other while learns to minimize the target disparity by making two predicted boxes overlap as much as possible.

4 Experiments

4.1 Datasets

Following six object detection datasets are used: Pascal VOC

(pascal), Clipart (progressive), Comic (progressive), Sim10k (sim10k), Cityscapes (cityscapes) and FoggyCityscapes (foggy). Pascal VOC contains categories of common real-world objects and images. Clipart contains 1k images and shares categories with Pascal VOC. Comic2k contains 1k training images and 1k test images, sharing categories with Pascal VOC. Sim10k has images with bounding boxes of car categories, rendered by the gaming engine Grand Theft Auto. Both Cityscapes and FoggyCityscapes have training images and validation images with 8 object categories. Following swda, we evaluate the domain adaptation performance of different methods on the following four domain adaptation tasks, VOC-to-Clipart, VOC-to-Comic2k, Sim10k-to-Cityscapes, Cityscapes-to-FoggyCityscapes, and report the mean average precision (mAP) with a threshold of .

4.2 Implementation Details

Stage 1: Source-domain pre-training. In the basic experiments, Faster-RCNN (Faster) with ResNet-101 (resnet) or VGG-16 (vgg) as backbone is adopted and pre-trained the on the source domain with a learning rate of for 12k iterations.

Stage 2: Category adaptation. The category adaptor has the same backbone as the detector but a simple classification head. It’s trained for iterations using SGD optimizer with an initial learning rate of , momentum , and a batch size of for each domain. The discriminator is a three-layer fully connected networks following DANN (DANN). is kept for all experiments. is when , when and otherwise.

Stage 3: Bounding box adaptation. The box adaptor has the same backbone as the detector but a simple regression head (two-layer convolutions networks). The training hyper-parameters (learning rate, batch size, etc.) are the same as that of the category adaptor. is kept for all experiments.

Stage 4: Target-domain pseudo-label training. The detector is trained on the target domain for iterations, with an initial learning rate of and reducing to exponentially.

The adaptors and the detector are trained in an alternative way for iterations. We perform all experiments on public datasets using a 1080Ti GPU. Codes will be available at

aero bcycle bird boat bottle bus car cat chair cow table dog hrs bike prsn plnt sheep sofa train tv mAP
Source Only 35.6 52.5 24.3 23.0 20.0 43.9 32.8 10.7 30.6 11.7 13.8 6.0 36.8 45.9 48.7 41.9 16.5 7.3 22.9 32.0 27.8
DA-Faster wild 15.0 34.6 12.4 11.9 19.8 21.1 23.2 3.1 22.1 26.3 10.6 10.0 19.6 39.4 34.6 29.3 1.0 17.1 19.7 24.8 19.8
BDC-Faster swda 20.2 46.4 20.4 19.3 18.7 41.3 26.5 6.4 33.2 11.7 26.0 1.7 36.6 41.5 37.7 44.5 10.6 20.4 33.3 15.5 25.6
WST-BSR wst-bsr 28.0 64.5 23.9 19.0 21.9 64.3 43.5 16.4 42.0 25.9 30.5 7.9 25.5 67.6 54.5 36.4 10.3 31.2 57.4 43.5 35.7
SWDA swda 26.2 48.5 32.6 33.7 38.5 54.3 37.1 18.6 34.8 58.3 17.0 12.5 33.8 65.5 61.6 52.0 9.3 24.9 54.1 49.1 38.1
MAF maf 38.1 61.1 25.8 43.9 40.3 41.6 40.3 9.2 37.1 48.4 24.2 13.4 36.4 52.7 57.0 52.5 18.2 24.3 32.9 39.3 36.8
SCL scl 44.7 50.0 33.6 27.4 42.2 55.6 38.3 19.2 37.9 69.0 30.1 26.3 34.4 67.3 61.0 47.9 21.4 26.3 50.1 47.3 41.5
CRDA exploring 28.7 55.3 31.8 26.0 40.1 63.6 36.6 9.4 38.7 49.3 17.6 14.1 33.3 74.3 61.3 46.3 22.3 24.3 49.1 44.3 38.3
ICR-CCR exploring 28.7 55.3 31.8 26.0 40.1 63.6 36.6 9.4 38.7 49.3 17.6 14.1 33.3 74.3 61.3 46.3 22.3 24.3 49.1 44.3 38.3
HTCN harmonizing 33.6 58.9 34.0 23.4 45.6 57.0 39.8 12.0 39.7 51.3 21.1 20.1 39.1 72.8 63.0 43.1 19.3 30.1 50.2 51.8 40.3
ATF triway 41.9 67.0 27.4 36.4 41.0 48.5 42.0 13.1 39.2 75.1 33.4 7.9 41.2 56.2 61.4 50.6 42.0 25.0 53.1 39.1 42.1
Unbiased unbiased 30.9 51.8 27.2 28.0 31.4 59.0 34.2 10.0 35.1 19.6 15.8 9.3 41.6 54.4 52.6 40.3 22.7 28.8 37.8 41.4 33.6
D-adapt 49.3 63.8 40.1 34.4 49.5 87.3 51.2 33.3 47.5 59.1 27.9 22.4 42.3 73.9 68.2 49.1 24.9 35.1 58.9 64.6 49.1
Table 1: Results from PASCAL VOC to Clipart (ResNet101).

4.3 Comparison with State-of-the-Arts

Adaptation between dissimilar domains.

We first show experiments on dissimilar domains using the Pascal VOC Dataset as the source domain and Clipart as the target domain. Table 1 shows that our proposed method outperforms the state-of-the-art method by points on mAP. Figure 4 presents some qualitative results in the target domain. We also compare with Unbiased Teacher (unbiased), the state-of-the-art method in semi-supervised object detection, which generates pseudo labels on the target domain from the teacher model. Due to the large domain shift, the prediction from the teacher detection model is unreliable, thus it doesn’t do well. In contrast, our method alleviates the confirmation bias problem by generating pseudo labels from different models (adaptors).

Method bike bird car cat dog prsn mAP
Source Only 32.5 12.0 21.1 10.4 12.4 29.9 19.7
DA-Faster wild 31.1 10.3 15.5 12.4 19.3 39.0 21.2
SWDA swda 36.4 21.8 29.8 15.1 23.5 49.6 29.4
MCAR dual 47.9 20.5 37.4 20.6 24.5 53.6 33.5
Instance Adapt 39.5 17.7 26.5 27.3 22.4 48.4 30.3
Global Adapt 31.9 15.7 30.3 21.3 17.1 37.9 25.7
D-adapt 52.4 25.4 42.3 43.7 25.7 53.5 40.5
Oracle 42.2 35.3 31.9 46.2 40.9 70.9 44.6
Table 2: Results from VOC to Comic.

We also use Comic2k as the target domain, which has a very different style from Pascal VOC and a lot of small objects333Few methods report results on this dataset for its difficulty.. As shown in Table 2, both image-level and instance-level feature adaptation will fall into the dilemma of transferability and discriminability, and do not work well on this difficult dataset. In contrast, our method effectively solves this problem by decoupling the adversarial adaptation from the training of the detector and improves mAP by 7.0 compared with the state-of-the-art.

Figure 4: Qualitative results on the target domain.

Adaptation from synthetic to real images.

We use Sim10k as the source domain and Cityscapes as the target domain. Following swda, we evaluate on the validation split of the Cityscapes and report the mAP on car. Table 4 shows that our method surpasses all other methods.

Method Backbone AP on Car
Source Only VGG-16 34.6
DA-Faster wild 38.9
BDC-Faster swda 31.8
SWDA swda 40.1
MAF maf 41.1
Selective DA selective_da 43.0
CDN CDN 49.3
HTCN* harmonizing 42.5
ATF triway 42.8
CADA every-pixel 49.0
MeGA-CDA mega 44.8
UMT* umt 43.1
D-adapt 50.3
Oracle 69.7
Source-only ResNet101 41.8
CADA every-pixel 51.2
D-adapt 53.2
Oracle 70.4
Table 4: Results from Cityscapes to Foggy Cityscapes.
Method Backbone prsn rider car truck bus train mcycle bcycle MAP
Source only VGG-16 25.1 32.7 31.0 12.5 23.9 9.1 23.7 29.1 23.4
DA-Faster wild 25.0 31.0 40.5 22.1 35.3 20.2 20.0 27.1 27.7
BDC-Faster swda 26.4 37.2 42.4 21.2 29.2 12.3 22.6 28.9 27.5
SW-DA swda 36.2 35.3 43.5 30.0 29.9 42.3 32.6 24.5 34.3
Selective DA selective_da 33.5 38.0 48.5 26.5 39.0 23.3 28.0 33.6 33.8
DD-MRL* diversify 30.8 40.5 44.3 27.2 38.4 34.5 28.4 32.2 34.5
CADA every-pixel 41.9 38.7 56.7 22.6 41.5 26.8 24.6 35.5 36.0
ATF triway 34.6 47.0 50.0 23.7 43.3 38.7 33.4 38.8 38.7
MCAR dual 32.0 42.1 43.9 31.3 44.1 43.4 37.4 36.6 38.8
HTCN* every-pixel 33.2 47.5 47.9 31.6 47.4 40.9 32.3 37.1 39.8
D-adapt 44.3 48.1 54.6 28.6 34.4 28.5 33.8 41.1 39.2
D-adapt* 44.9 54.2 61.7 25.6 36.3 24.7 37.3 46.1 41.3
Oracle 47.4 40.8 66.8 27.2 48.2 32.4 31.2 38.3 41.5
Source-only ResNet101 33.8 34.8 39.6 18.6 27.9 6.3 18.2 25.5 25.6
CADA every-pixel 41.5 43.6 57.1 29.4 44.9 39.7 29.0 36.1 40.2
D-adapt 42.8 48.4 56.8 31.5 42.8 37.4 35.2 42.4 42.2
D-adapt* 40.8 47.1 57.5 33.5 46.9 41.4 33.6 43.0 43.0
Oracle 44.7 43.9 64.7 31.5 48.8 44.0 31.0 36.7 43.2
Table 3: Sim10k to Cityscapes.

Adaptation between similar domains.

We perform adaptation from Cityscapes to FoggyCityscape and report the results444* denotes this method utilizes CycleGAN to perform source-to-target translation. in Table 4. Note that since the two domains are relatively similar, the performance of adaptation is already close to the oracle results.

4.4 Ablation Studies

In this part, we will analyze both the performance of the detector and the adaptors. Denote be the number of proposals of class predicted as class , be the total number of proposals of class , and be the number of classes (including the background), then we use to measure the overall performance of the category adaptor. We use the intersection-over-union between the predicted bounding boxes and the ground truth boxes, i.e., , to measure the performance of the bounding box adaptor. All ablations are performed on VOC Clipart and the iteration is kept 1 for a fair comparison.

Ablation on the category adaptation.

IoU threshold 0.05 0.3 0.5 0.7
36.1 38.2 46.7 51.4
Table 5: Effect of proposals’ quality.

Table 7(a) show the effectiveness of several specific designs mentioned in Section 3.2. Among them, the weight mechanism has the greatest impact, indicating the necessity of the low-density assumption in the adversarial adaptation. To verify this, we assume that the ground truth IoU of each proposal is known, and then we select the proposal with IoU greater than a certain threshold when we train the category adaptor. Table 5 shows that as the IoU threshold of the foreground proposals improves from to , the accuracy of the category adaptor will increase from to , which shows the importance of the low-density separation assumption.

metric ours w/o condition w/o bg proposals w/o weight w/o adaptor
38.2 36.9 - 36.6 33.6 25.1 17.2 12.6
mAP 43.5 41.7 - 41.7 38.8 36.5 33.3 28.0
(a) Category adaptation
metric Ours w/o DD w/o adaptor
0.631 0.598 0.531
mAP 45.0 44.4 43.5
(b) Spatial Adaptation
metric standard way ours
0.1 0.3 0.5 0.7 0.1
17.2 17.6 17.1 16.3 38.2
mAP 38.9 37.3 35.9 34.4 43.5
(c) Training strategy
Table 6: Ablations on PASCAL VOC to Clipart. Note that no bounding box adaptation is adopted in (a) and (c) for a fair comparison. (a) Category adaptation. w/o condition: use a class-independent discriminator. w/o bg proposals: no background proposals added to source domain or target domain or neither. w/o weight: remove the weight mechanism in Equation 3. w/o adaptor: remove the category adaptation step and directly use the labels generated from detector on the target domain as pseudo labels. (b) Spatial Adaptation. w/o DD: remove the disparity discrepancy in Equation 6. w/o adaptor: remove the bounding box adaptation step and only trains the classification branch of the detector. (c) Training strategy. In the standard training, if the confidence threshold increases, the number of false negatives will increase, otherwise the number of false positives will increase.

Ablation on the bounding box adaptation.

Table 7(b) illustrates that minimizing the disparity discrepancy improves the performance of the box adaptor and bounding box adaptation improves the performance of the detector in the target domain. The gain brought by box adaptation is consistent, for example when , it can still improve the mAP from to .

Ablation on the training strategy with pseudo-labels.

In Equation 2, losses are only calculated on the regions where the proposals are located, and those anchor areas overlapping with the proposals are ignored. Here, we compare this strategy with the common practice in self-training – filter out bounding boxes with low confidence, then label each proposal that overlaps with these boxes. Although the category labels of these bounding boxes are also generated from the category adaptor, the accuracy of these generated proposals is low (see Table 7(c)). In contrast, our strategy is more conservative and both the on the proposals and the final mAP of the detector are higher.

4.5 Analysis

Method aero bcycle bird boat bottle bus car cat chair cow table dog hrs bike prsn plnt sheep sofa train tv mAP
Source Only 30.1 40.8 21.7 15.3 28.4 51.6 33.1 13.1 34.5 14.2 29.6 16.2 21.4 53.1 37.4 30.3 6.9 24.8 31.8 42.1 28.8
Global Adapt 33.2 43.4 23.8 24.5 43.4 54.9 36.5 6.5 36.0 19.1 26.4 13.0 23.6 49.4 52.6 39.8 5.8 27.6 39.1 54.1 32.6
Local Adapt 31.0 28.3 26.2 18.2 42.2 53.5 33.6 18.4 37.2 33.2 28.7 14.3 33.4 54.6 48.7 40.4 6.8 30.4 42.1 48.1 33.4
Global + Local 37.5 50.4 25.3 28.8 45.0 51.7 45.9 16.9 38.2 31.9 24.2 12.6 26.4 48.7 53.4 44.5 5.5 28.2 45.7 53.5 35.7
D-adapt 50.8 67.7 42.5 37.6 50.8 58.7 57.1 29.6 48.1 58.5 29.4 22.4 43.5 67.0 64.3 52.7 28.1 37.4 60.0 64.9 48.6
Table 7: Results from PASCAL VOC to Clipart (RetinaNet, ResNet101).
Figure 5: Error analysis.

Error Analysis.

Figure 5 gives the percent of error of each model on VOCClipart following (tide). The main errors in the target domain come from: Miss (ground truth regarded as backgrounds) and Cls (classified incorrectly). Loc (classified correctly but localized incorrectly) errors are slightly less, but still cannot be ignored especially after category adaptation, which implies the necessity of box adaptation in object detection. Category adaptation can effectively reduce the proportion of Cls errors while increasing that of Loc errors, thus it is reasonable to cascade the box adaptor after the category adaptor.

Beyond Faster-RCNN.

As shown in Tables 7, our method also applies to the one-stage detector RetinaNet (retina) , which improves the mAP by on VOC Clipart.

Feature visualization.

We visualize by t-SNE (tsne) in Figures 6(a)-6(b) the representations of task VOC Comic2k (6 classes) by category adaptor with and category adaptor with . The source and target are well aligned in the latter, which indicates that it learns domain-invariant features. We also extract box features from the detector and get Figure 6(c)-6(d). We find that the features of the detector do not have an obvious cluster structure, even on the source domain. The reason is that the features of the detector contain both category information and location information. Thus adversarial adaptation directly on the detector will hurt its discriminability, while our method achieves better performance through decoupled adaptation.

(a) Adaptor ()
(b) Adaptor ()
(c) Baseline (mAP:19.7)
(d) Ours (mAP:40.5)
Figure 6: T-SNE visualization of features. (a) and (b) are features from the category adaptor. (c) and (d) are features from the Faster RCNN. (Orange: VOC; Blue: Comic2k).

5 Discussion and Conclusion

Our method achieved considerable improvement on several benchmark datasets for domain adaptation. In actual deployment, the detection performance can be further boosted by employing stronger adaptors without introducing any computational overhead since the adaptors can be removed during inference. It is also possible to extend the D-adapt framework to other detection tasks, e.g., instance segmentation and keypoint detection, by cascading more specially designed adaptors. We hope D-adapt will be useful for the wider application of detection tasks.


Appendix A More qualitative results.

Figure 7-10 gives more qualitative results on Faster RCNN.

Figure 7: Qualitative results on VOC Clipart.
Figure 8: Qualitative results on VOC Comic.
Figure 9: Qualitative results on Sim10k Cityscapes.
Figure 10: Qualitative results on Cityscapes Foggy Cityscapes.