Log In Sign Up

Self-Guided Adaptation: Progressive Representation Alignment for Domain Adaptive Object Detection

by   Zongxian Li, et al.

Unsupervised domain adaptation (UDA) has achieved unprecedented success in improving the cross-domain robustness of object detection models. However, existing UDA methods largely ignore the instantaneous data distribution during model learning, which could deteriorate the feature representation given large domain shift. In this work, we propose a Self-Guided Adaptation (SGA) model, target at aligning feature representation and transferring object detection models across domains while considering the instantaneous alignment difficulty. The core of SGA is to calculate "hardness" factors for sample pairs indicating domain distance in a kernel space. With the hardness factor, the proposed SGA adaptively indicates the importance of samples and assigns them different constrains. Indicated by hardness factors, Self-Guided Progressive Sampling (SPS) is implemented in an "easy-to-hard" way during model adaptation. Using multi-stage convolutional features, SGA is further aggregated to fully align hierarchical representations of detection models. Extensive experiments on commonly used benchmarks show that SGA improves the state-of-the-art methods with significant margins, while demonstrating the effectiveness on large domain shift.


Joint Distribution Alignment via Adversarial Learning for Domain Adaptive Object Detection

Unsupervised domain adaptive object detection aims to adapt a well-train...

SimROD: A Simple Adaptation Method for Robust Object Detection

This paper presents a Simple and effective unsupervised adaptation metho...

Learning to Ignore: Fair and Task Independent Representations

Training fair machine learning models, aiming for their interpretability...

MeGA-CDA: Memory Guided Attention for Category-Aware Unsupervised Domain Adaptive Object Detection

Existing approaches for unsupervised domain adaptive object detection pe...

Synergizing between Self-Training and Adversarial Learning for Domain Adaptive Object Detection

We study adapting trained object detectors to unseen domains manifesting...

1 Introduction

Convolutional neural networks (CNNs) [23]

have become one prevalent model for computer vision tasks, such as image classification 

[23, 42] and object detection [14, 38, 28]. Nevertheless, many CNN models, CNN-based object detectors, require a large amount of annotated training data, which are costly and time-consuming to collect. Transferring detection models trained on a label-rich domain (publicly annotated datasets) to an unlabeled domain (real-world scenarios) in an unsupervised way has therefore attracted increasing interests recently [6, 39, 50, 47].

Figure 1: Self-Guided Adaptation (SGA) approach adaptively samples image pairs around domain boundary, so that the adaptation procedure is implemented in a progressive ”easy-to-hard” manner(top). In contrast, conventional approaches randomly select samples for while ignoring the sample distribution during adaptation (down).

As the core of transfer learning, unsupervised domain adaptation (UDA) has been extensively explored for model and feature transfer  

[34, 33, 29, 46]. Early UDA methods mainly focused on the field of image recognition, while recently shifted to the object detection [6, 50]. The well-known Faster R-CNN [38] has been updated to be domain adaptive by using adversarial learning to align feature between source and target domains. And then,  [39] and  [50] attempted to reduce domain discrepancy respectively at global- and region-level.

In a broad view, most UDA methods achieved cross-domain feature alignment root on the adversarial learning while they largely ignore the sampling strategy for each mini-batch when optimizing domain adaptation models. It is therefore unreasonable to align feature representations under a fixed constraint without considering the instantaneous domain shift. Accordingly, immutable sampling strategies which assume that each sample is of equal importance for adaptation is implausible. Considering that the sample distribution is dynamic, some source domain samples have been falling into the target domain (easy-to-align) while others are far from the target domain (hard-to-align), Fig. 1. In addition, [50, 6, 4] attempted to align feature distribution on the generated region candidates. Nevertheless, directly aligning at instance level is implausible as it is hard to generate precise region proposals if there exists large domain shift.Recent research [39] attempted to solve the problem by introducing the Focal Loss [26] to weight and reduce the impact of hard samples. Nevertheless, the weights applied on training samples are still fixed, and can not be adaptive to the change of domain shift and variation of the sample distribution.

In this study, we propose a Self-Guided Adaptation (SGA) model with a Self-Guided Progressive Sampling(SPS) strategy, and target at aligning feature representation and transferring detection models across domains while considering the instantaneous alignment difficulties. Our SGA with SPS is inspired by the self-paced curriculum, which simulates the learning process of humans and gradually proceeds from easy to complex samples, and progressively aligning representation [20] with fully respect to the instantaneous domain shift.

The progressive representation alignment is implemented with adversarial learning by fully considering the hardness of sample pairs (one from the source domain and the other from the target domain) to be aligned. The hardness for each sample pair is defined on the feature distance between its two samples in a Reproducing Kernel Hilbert Space (RKHS). According to the hardness, we dynamically adjust the constraint for adversarial learning and implement domain adaptation in a progressive “easy-to-hard” manner. During the learning procedure, the model tends to align easy sample pairs at early iterations and gradually shifts to hard ones at later iterations, Fig. 1 (upper).

The contributions of this work are summarized as follows:

  • A Self-Guided Adaptation model (SGA), which implements the representation alignment dynamically by fully leveraging the instantaneous sample hardness defined in the Reproducing Kernel Hilbert Space (RKHS).

  • A Self-Guided Progressive Sampling (SPS) strategy based on the instantaneous sample hardness, which is able to leverage instantaneous sample distances for progressive sampling and representation alignment.

  • State-of-the-art performances on commonly used benchmarks and significant effectiveness over various domain shift settings.

Figure 2: Self-Guided Adaptation (SGA) model. In the first step, the Maximum Mean Distance is calculated over each mini-batch in the RKHS, which indicates the hardness to be aligned of a pair of samples (, ) from source and target domain. In the second step, a hardness-guided loss is designed for the domain discriminator , with which we learn domain-adpative feature generator in a progressive and adversarial manner.

2 Related Work

In this section, we first review UDA methods from a general perspective. We then review domain adaptive object detection approaches.

Unsupervised Domain Adaptation (UDA). UDA aims to minimize the performance drop when transferring the model trained on the label-rich domain/dataset to an unlabelled domain/dataset. In the past few years, UDA has been extensively explored in broad fields of computer vision, including object classification [25, 32, 35, 29], object detection [6, 39, 50, 22],and person re-identification [9, 49, 36]. The key of UDA is to align the feature distributions of source and target domains. To this end, theoretical analysis about the domain/datasets shift are given in [34, 33, 29], by measuring the feature distance between different domains/datasets [11, 29].

Based on the analysis, one line of UDA methods align feature representations by minimizing the domain distance. Maximum Mean Discrepancy (MMD) [15], a domain distance metric, was proposed to minimize the domain shift in the space of Reproducing Kernel Hilbert Space(RKHS) [29, 30, 7]. The other line of methods [17, 3, 27] attempted reducing the domain shift by taking advantages of adversarial discrimination to confuse source and target domains while aligning feature distributions. The representative CyCADA [17] transferred samples across domains at both pixel- and feature-level. Domain confusion loss [12, 1] was designed to learn domain-invariant features. Saito et al. [40]

aligned distributions of source and target domains by maximizing the discrepancy of classifiers’ outputs. In addition, training adaptation model with pseudo labels has achieved increasing attention. 


achieved progressive alignment by assigning pseudo-labels to easy samples with respect to the intra-class distribution variance.

Domain Adaptive Object Detection. UDA has attracted renewed interests in the object detection area since 2018 [6] with the key idea to align the feature distributions between source and target domains. Kuniaki et al. pointed out that more emphasis should be put on images that are globally similar and proposed the strong-weak distribution alignment [39].

Different with image classification that considers a holistic image, domain adaptive object detection focuses on local regions [50]. Strongly matching the entire distributions of source and target images to each other at image level may fail, as domains have distinct scene layouts and different combinations of objects. A Domain Adaptive Faster R-CNN(DA-Faster R-CNN) [6] was proposed to minimize the discrepancy among two domains by exploring both image- and instance-level domain classifier in an adversarial manner. The similar motivation was used to align feature representation across domains on enlarged positive regions [50]. Mean Teacher with object relations [4] was also considered, which addressed the adaptive detection from the viewpoint of graph-structured consistency. Moreover, reducing the domain gap by bridging an intermediate domain between source and target domain was adopted by [22, 18].

Many existing works extensively investigated the global- or region-level representation alignment, while unfortunately ignored the instantaneous alignment difficulty and sampling strategy during model learning, which could deteriorate the feature representation given a large domain shift. In this work, we propose a Self-Guided Adaptation strategy, and target at aligning representations and transferring models across domains while considering instantaneous sample distances. This different yet novel perspective makes our approach be complementary to many existing domain adaptive approaches.

3 The Proposed Approach

In this section, we first describe the Self-Guided Adaptation (SGA) model based on sample hardness and adversarial learning, Fig. 2. We then detail the Self-Guided Progressive Sampling (SPS) for progressive representation alignment based on the proposed SGA. The entire procedure of representation alignment is organized by simulating the learning process of human and gradually proceeds from “easy-to-hard” samples. The two-stage object detector, Faster R-CNN, is employed as a base detector, and the alignment is operated on the backbone network, which includes features from three convolutional stages.

3.1 Self-Guided Adaptation Model

Figure 3: Illustration of the proposed approach for domain adaptive object detection. Three Self-Guided Adaptation (SGA) models are applied on three convolutional stages for representation alignment. The Self-Guided Progressive Sampling (SPS) adaptively selects samples so that the entire adaptation procedure is implemented in a progressive easy-to-hard manner. indicates selected samples and discarded samples.

In the UDA setting, a pair of samples is composed of a labeled image with fully supervision (i.e., bounding boxes and categories) from source domain , and an unlabeled image without any supervision from the target domain . The objective of SGA is performing model adaption from to while considering the instantaneous domain shift. To fulfill this purpose, a hardness factor is defined for each sample pair based on the Maximum Mean Discrepancy (MMD) distance in each mini-batch.

Self-Guided Hardness. As a classical metric for comparing distributions based on the Reproducing Kernel Hilbert Space (RKHS) [15], denoted as , MMD has been widely used for minimizing the domain shift in the field of transfer learning, which is able to preserve all of the statistical features of arbitrary distributions by embedding distributions into infinite-dimensional feature space, while allowing one to compare and manipulate distributions using Hilbert space inner product operation [44].

Considering two distributions and , which respectively represent the source and the target domains, MMD is defined as


where denotes a set of functions of unit balls in a RKHS. As shown in Fig. 2 (left), two sample images are fed to a feature extractor () to extract features, based on which the instantaneous MMD is computed in each learning iteration.

Denote and as the output features in a mini-batch and the space is a Hilbert space with inner product and the corresponding norm

, the empirical estimate of MMD can be rewritten as


where represents the kernel distance mapping:  [15]. and denote the numbers of source and target samples in a batch, respectively. According to [2], the

equation of the vector-matrix multiplication form of MMD is calculated as



is a Radial Basis Function (RBF) kernel function.

With Eq. 3, we calculate the MMD-based hardness for each sample pairs between source and target domains. Such hardness is used as a loss to minimize the domain shift as well as serving as a self-guided metric to determine the constraint of the representation alignment between sample pairs. Sample pair which owns a large MMD distance are hard to be aligned, and vice versa, and then we adaptively assign loss to sample pairs in an Self-Guided manner.

Hardness-guided Adaptation. The adversarial learning framework is constructed by combining a feature generator () and a domain discriminator () with a Gradient Reverse Layer (GRL) module  [11]. is used to extract domain-features while the

requires to predict the probability of domain label

. is trained for distinguishing the source and target samples while is trained for deceiving the with the reversed gradient. In this way, tends to learn feature representation that can cover both the source and target domains.

Based on the adversarial learning framework, we construct a hardness-guided model for representation alignment, as illustrated in Fig. 2(right) . To introduce sample hardness into the adversarial procedure, we first update the domain classification loss of from the Cross-Entropy loss to the Focal loss [39], which assigns larger weights to easier samples (close to domain boundary) and smaller weights to harder ones (far away from domain boundary)111Sample pair will be considered as an easy one if it is hard to be classified under the domain adaptation setting, which is opposite of general classification.. Denote as discriminator ’s estimated probability, the hardness-guided Focal Loss is defined as


The hardness factor defined in RKHS is calculated by Eq. 3, constraining adversarial learning dynamically. denotes the domain label, which is assigned to 1 for a sample from source domain and 0 otherwise. equals to the model’s estimated probability with domain label and otherwise. Specifically for a pair of training samples

, the domain adaptive adversarial loss function is formulated as


which describes an adversarial learning procedure under the constraints of sample hardness . With the estimated in the RKHS, the SGA adaptively indicates the importance of samples and assign them different constraints.

Accordingly, the total loss function for a domain adaptive detector considering a batch of samples is concluded as


where denotes the number of samples in a mini-batch. denotes the loss function of Faster R-CNN over training samples in the source domain. denotes the loss about sample hardness, defined as . is a regularization factor which is experimentally determined, and we set for all experiments.

3.2 Progressive Representation Alignment

The SGA model is defined for samples in each mini-batch. In what follows, we further propose a sampling strategy to construct mini-batches for progressive representation learning and alignment.

Self-Guided Progressive Sampling (SPS). In conventional transfer learning approaches, sample pairs are randomly selected from source and target domains, without considering the difficulty of alignment and the sample distribution, as shown in the second row of Fig 1. We argue that this sampling strategy is implausible as selected samples could have very large domain distances, which are difficult to be aligned at early iterations.

Motivated by the self-paced learning paradigm [24], we propose to train the adaptation model with an “easy-to-hard” way. Easy sample pairs that are “easy-to-align” will be selected with a higher priority in the early training iterations, while harder sample pairs will be selected later. To measure the alignment difficulty of training sample pairs, we use the average “hardness” calculated on samples among SGA modules once again.

Specifically, we define a sampling strategy by introducing an adaptive threshold over the ”hardness”. In each training iteration, sample pairs which have smaller average “hardness” than

are selected for model optimization. We first train the model for an pre-epoch and record the “hardness” for each iteration and sort it, and the “hardness” median value is selected as an initial

. We retrain the model with the selected initial . After each training epoch, is updated to a new median value according to the sorted recorded “hardness” during the previous training epoch, which means that is keep decreasing with model adapting, and more samples can be automatically included into training in a self-guided manner. Accordingly, the sampled loss function is defined as

Input: input pair: (, ), initial threshold:
for , is the number of training epoch do
       for , is the number of steps do
             Estimate Hardness by Eq. 3;
             Calculate adversarial loss by Eq. 5;
             Calculate mini-batch loss by Eq. 6;
             if  then
                   Calculate total loss by Eq. 7;
Algorithm 1 Progressive Representation Alignment

where determines whether or not a sample pair should be selected for aligning. if the average of the hardness factors satisfy and , otherwise, where refers to the average estimated hardness values among SGA modules. In this way, easy pairs contain instances with same categories or simillar appearance will be slected in the early iteration, as indicated in Fig. 3.

Methods  H-G  H-L  SPS bicycle bird car cat dog person mAP (%)
Source-Only - - - 69.4 47.1 39.2 33.5 21.4 58.1 44.78

DA-Faster [6]
- - - 75.2 40.6 48.0 31.5 20.6 60.0 45.98

WST-BSR [21]
- - - 75.6 45.8 49.3 34.1 30.3 64.1 49.87

SW-Faster [39]
- - - 82.3 55.9 46.5 32.7 35.5 66.7 53.27

- - - 67.2 55.8 48.1 39.1 32.4 64.4 51.17
Baseline-A - - - 69.7 49.1 47.2 28.3 21.7 60.5 46.08

- - - 66.4 53.7 43.8 37.9 31.9 65.3 49.83

- - 79.5 48.4 48.6 38.1 38.3 64.2 52.85

- 81.2 54.9 48.7 37.9 37.8 66.0 54.42

87.7 54.1 48.5 38.3 37.4 65.2 55.20
SGA-S(Best) 88.5 54.5 49.1 38.0 37.2 64.9 55.30
Table 1: Ablation studies and comparison on the Pascal VOC WaterColor task: SW-Faster donates our reproduction of the  [39]. SGA-S is a complete implementation of our proposed method. Baseline-A, Baseline-B, SGA-G, and SGA-L are trained for ablation studies. H-G, H-L, and SPS denote the proposed hardness-guided adversarial loss , hardness loss and Self-Guided Progressive Sampling, respectively.

Implementation. Based on the SGA model and SPS strategy, we implement domain adaptive object detection based on the Faster R-CNN framework, Fig. 3. Given an image from the source domain, convolutional features are first extracted by the backbone network. Region proposal network (RPN) is used to generate proposals, and ROI pooling is used to extract features for object classification and localization. The model is trained in by optimizing object detection loss in the source domain. The objective of the proposed Self-Guided Adaptation is transferring the supervised detection model from source to target domain without using any annotation involved in the target domain. We propose to apply three SGA modules corresponding three stages of features. Each module has an independent domain classifier at different stages of the backbone network, Fig. 3. Three SGA modules are simultaneously optimized with respect to each domain classifier.

During each learning iteration, a mini-batch of samples are selected with the proposed SPS, which are used to adapt the learned detection model to the target domain by aligning feature representation of the source and the target domain in a Self-Guided manner. The entire learning procedure is summarized in Algorithm. 1.

Note that the proposed domain adaptation procedure is performed on feature maps instead of region proposals. The reason lies that the MMD-based hardness constraint is able to help finding image pairs with similar content and containing objects from the same categories to a certain extent. This alleviates the mismatch of representation adaptation across object categories and enables our approach to avoid relying on region proposals, which greatly simplifies the model adaptation procedure. Furthermore, directly aligning at instance level may fail since it is hard for RPN to generate precise region candidates if there exists large domain shift among feature maps which are not be well-aligned.

4 Experiments

Experiments are conducted over four domain shift tasks including Pascal VOC [10] WaterColor[19], Cityscapes[8] FoggyCityscape[41], Cityscape Detrac-Night[31] and KITTI[13] Cityscape that have rich variations in domain shift caused by illumination conditions, cameras views, image styles, etc. We compare the proposed approach with state-of-the-art methods and extensive ablation studies are conducted to validate the effectiveness of each proposed component.

4.1 Experiments Settings

Faster R-CNN (ResNet101 [16]

-based) pre-trained on the ImageNet 


is employed as the base detector in all experiments. While training the domain adaptation network, the inputs are a pair of images, including an annotated image from the source domain and an unannotated image from the target domain. The network is trained with a learning rate of 0.001 in the first 50,000 iterations and decreased to 0.0001 in the following iterations. All experiments are implemented by using the widely used Pytorch framework 

[37]. Without specific notations, we respectively report the average and the best mean Average Precision(mAP) observed from 70,000 to 100,000 iterations for evaluation and a fair comparison.

Methods  H-G  H-L  SPS bus cycle car bike prsn rider train truck mAP (%)

- - - 24.5 28.7 36.1 19.9 25.7 33.4 10.5 19.8 24.83

SW-Faster [39]
- - - 36.2 35.3 43.5 30.0 29.9 42.3 32.6 24.5 34.29

SW-Faster [39]
- - - 36.9 36.1 42.9 31.9 29.1 43.2 31.8 25.3 34.65

Mean-Teacher [4]
- - - 30.6 41.4 44.0 21.9 38.6 40.6 28.3 35.6 35.13

S-CDA [50]
- - - 33.5 38.0 48.5 26.5 39.0 23.3 28.0 33.6 33.80

- - - 25.9 31.7 38.2 22.6 27.5 24.1 28.4 26.1 28.07

- - - 34.7 33.9 42.4 26.4 27.8 42.1 15.7 23.5 30.81

- - 43.7 32.5 44.1 25.6 29.6 43.8 32.1 23.4 34.35

- 46.6 33.1 43.8 22.7 30.2 44.3 37.3 26.1 35.51

47.4 34.7 44.2 25.9 30.6 43.5 40.7 25.8 36.60

51.6 35.1 44.5 26.4 31.9 43.2 41.3 29.5 37.94
Table 2: Comparison of SGA with state-of-the-art methods and ablation studies for the Cityscapes FoggyCityscape task: SW-Faster donates our reproduction of [39] by using the ResNet101-based backbone.
 Methods H-G H-L SPS  AP on car(%)
 Source-Only - - -  42.02
 DA-Faster [6] - - -  44.51
 SW-Faster [39] - - -  46.43
 Baseline-A - - -  44.12
 Baseline-B - - -  45.35
 SGA-G - -  46.87
 SGA-L -  47.88
 SGA-S(Avg)  48.59
 SGA-S(Best)   49.67
Table 3: Comparison of our SGA with state-of-the-art methods as well as ablation studies for the daytime night-time task: DA-Faster and SW-Faster denote our reproduction of the two methods by using the released code.

4.2 Domain Adaptive Object Detection

We compare the proposed SGA approach with a number of state-of-the-art works on adaption based object detection over four domain adaptation tasks, as listed in Tables. 1-4. For each task, a model Source-Only is trained by using the annotated source domain images without any adaptation. 5 detection models are trained including: (1) Baseline-A: DA-Faster R-CNN [6] with only image-level domain classifier trained with Cross-Entropy Loss; (2) Baseline-B: DA-Faster R-CNN with only image-level domain classifier trained with Focal Loss (the modulating factor is fixed at 5); (3) SGA-G: the proposed SGA with hardness-guided adaptation modules and adversarial loss only ( the first two terms in Eq. 6); (4) SGA-L: the SGA with both hardness-guided adversarial loss and hardness loss ( all terms in Eq. 6); (5) SGA-S: the SGA incorporates all our designed hardness-guided adversarial adaptation loss, hardness loss, and Self-Guided Progressive Sampling. The visualizations of object detection results over all domain adaptation settings are given in our supplementary document.

Natural Images to Artistic Images: In this domain adaptation task, we use the training and validation splits of Pascal VOC 2007 and 2012 [10] as the source domain dataset and the WaterColor [19] as the target domain dataset. The source dataset Pascal VOC consists of around 15,000 images, while the WaterColor is collected from Behance website [48], which consists of 1,000 artistic images that have the same 6 categories in common with the Pascal VOC. Images in two domains have very different styles, which bring a great challenge for domain adaptive object detection.

As Table. 1 shows, the Source-Only which is trained by using source domain images without any adaptation does not perform well while applied to the target domain images. For this task, our proposed SGA-S obtains a superior mAP of 55.20%, by sampling images from source and target domains progressively and train the model in a self-guided manner, which outperforms Baseline-A up to 9.12%(from 46.08% to 55.20%) and state-of-the-arts by large margins.

Illumination Changes: Illumination changes widely exists among images that are collected under different conditions in different environments. It is one of the most widely observed domain shifts, which often leads to a clear performance drop. We evaluate proposed approaches for two typical illumination change scenarios, namely, normal weather foggy weather and daytime nighttime.

For the normal weather foggy weather task, we adopt the Cityscape [8] as the source dataset and the FoggyCityscape  [41] as the target dataset. Cityscape images capture almost all common traffic objects and FoggyCityscape is generated from Cityscape by adding fog noise. Both datasets consist of 2,975 training images and 500 validation images. Table. 2 shows the comparison over the normal weather foggy weather task. As Table. 2 shows, the proposed SGA-S achieves superior performance over state-of-the-art methods(mAP: 36.60% for average & 37.94% for best), demonstrating the effectiveness of our method in handling dramatic weather condition changes. For a fair comparison, we update the backbone of SW-Faster [39] from VGG16 [43] to ResNet101, donated as SW-Faster in Table. 2.

Methods H-G H-L SPS K C C K
 Source-Only - - - 30.22 53.55
 DA-Faster [6] - - - 36.69 60.92
 S-CDA [50] - - - 42.70 -
 Baseline-A - - - 38.58 64.73
 Baseline-B - - - 39.75 68.17
 SGA-G - - 39.81 68.92
 SGA-L - 41.32 69.55
 SGA-S(Avg) 42.04 70.71
 SGA3(Best) 43.07 71.43
Table 4: Comparison of SGA with state-of-the-art methods as well as ablation studies over the KITTI Cityscapes task: Following the setting in [6], we report the AP on car in both adaptation direction, i.e. K C and C K.

For the daytime night-time task, the Cityscape [8] is adopted as the source dataset and the Detrac-Night [31] is used as the target dataset. Detrac-Night is re-sampled from the UA-Detrac [31] dataset where images are captured in the night-time at different locations. From UD-Detrac, 3,500 images are selected for training and 500 images for testing. Table. 3 shows experimental results observed by training the model for 30,000 iterations. For state-of-the-art methods in  [6, 39], we run their released codes for fair comparisons. [39] achieves very promising performance as it adopts a strong-and-weak alignment approach to capture domain-invariant features. As a comparison, the proposed SGA-S achieves superior performance (mAP: 48.59% for average & 49.67% for best).

Domain shifts widely exist when images are collected by using different cameras with different resolutions and positioned at different viewpoints even when the image style and illumination conditions are similar [45]. Following the settings in [6], we conduct experiments for adaptation between the KITTI and the CityScape where images have very different resolutions and collected by different cameras. The results are observed by training the model for 30,000 iterations.

As Table. 4 shows, the proposed SGA-S significantly outperforms the DA-Faster R-CNN [6] for adaptation in both directions. In addition, the region-based method [50] achieves comparable performance as our approach for adaptation K C. The close performance is largely attributed to the similar image styles between two domains where the misalignment is more about differences between the corresponding image regions.

Methods Stage-1 Stage-2 Stage-3 mAP(Avg)  mAP(Best)
SGA-S - - 35.12 36.58
SGA-S - 36.14 37.22
SGA-S 36.60 37.94
Table 5: Evaluation of SGA module numbers over the Cityscape FoggyCityscapes.

4.3 Ablation Study

We perform ablation studies over the four domain adaptation tasks as shown in Tables. 1-4. Additionally, we investigate the seperate contribution comes from each SGA module, as listed in Table. 5.

Tables. 1-4 show that the proposed hardness-guided adversarial loss, hardness loss, and Self-Guided Progressive Sampling consistently improve the performance. By training the model in an adversarial way with respect to the estimated hardness factor, SGA-G outperforms baselines by large margins in all four adaptation tasks. Take the Pascal VOC WaterColor an example. The Baseline-A trained with classic cross-entropy loss obtains a marginal mAP improvement (+1.3%) as compared with the Faster R-CNN (Source Only) which is trained by using the source domain images only. This indicates that direct alignment of representation without considering instantaneous domain shift is implausible. By capturing instantaneous domain shifts and assigning different losses to respective sample pairs, our SGA-G outperforms the Baseline-A and the Baseline-B by 6.77% and 3.02%(from 46.08% and 49.83% to 52.85%), respectively.

The hardness loss which aims to minimize the domain shift clearly improves the detection performance across all four domain adaptation tasks. The ablation model SGA-L with the hardness loss outperforms the SGA-G from 0.63% (for Cityscape KITTI) to 1.57% (for Pascal VOC WaterColor) in mAP. The clear performance improvements are largely attributed to the hardness factor estimated in a RKHS which helps to preserve statistical features and close the domain gaps by jointly learning the domain-invariant features between the source and target domains and minimizing the domain shift, simultaneously. The complete system SGA-S achieves the highest detection performances across all four domain adaptation tasks when we further sample image pairs from source and target domains progressively and train the model in a self-guided manner. The performance improvement over the SGA-L ranges from 0.71% (for daytime nighttime) to 1.22% (for CityScape KITTI). This clearly demonstrates the advantage of aligning samples from easy to hard progressively in a purely self-guided manner while training adaptive detection networks.

Table. 5 demonstrates the seperate contribution from each SGA module over the domain shift: CityscapeFoggyCityscape. We observe that perfermances increase gradually when the SGA module is incorporated one by one, which backs up the effectivenss of aggregating hierarchical representation.

4.4 Model Analysis

Figure 4: Model Analysis: (a) MMD-based hardness values for Baseline-A and our approaches; (b) Domain Confusion Degree analysis for Baseline-A and our approaches; (c) Error Analysis with respect to correct detection, background errors, and mislocalization, which based on 1,500 most confident detections. (d) Recall versus IoU threshold based on top-300 regions generated by RPN.

For further exploring the effectiveness of proposed approaches, we analyze both training procedures and detection results from following aspects, which are performed over tasks: Pascal VOCWaterColor and CityscapeFoggyCityscape.

Qualitative Analysis on Hardness in RKHS : We compute the mean values of estimated hardness by using Baseline-A and proposed methods, and illustrate the hardness values in Fig. 4 (a). It can be seen that estimated MMD-based hardness among the source and target domain keeps decreasing by applying the proposed approach. The domain shift declines faster by optimizing the model with the ”hardness” loss term. Moreover, SGA-S can further stabilize the decreasing procedure by selecting sample pairs in an “easy-to-hard” manner.

Domain Confusion Degree : The degree is defined for better analyzing the domain confusion capacity of the proposed method. Specifically, an image from Source(Target) domain will be considered as a confused one, when it is wrongly classified as the Target(Source) image by with the classification probability satisfy . In Fig. 4(b), it can be clearly observed that our approach can better confuse samples from source and target domains.

Error Analysis on Top-ranked Detections: We further diagnose proposed approaches by analyzing the detection errors among 1,500 top-ranked detections. Following the protocol in [6], the detection errors are categorized into three types including correct, mislocalization, and background error 222Please refers [6] for a clear definition about the three error types.. As Fig. 4 (c) shows, the correction detections increase gradually when our proposed approaches are incorporated one by one. At the same time, the mislocalization significantly drops when the self-guided components are incorporated.

Recall Rate analysis: We evaluate recall vs. overlap for 300 top-ranked proposals generated by RPN, as shown in Fig. 4(d). The plot shows that proposed methods contribute a lot for generating high-quality region candidates at RPN stage. Specifically, with IoU of 0.7, SGA-S achieves 33.54% recall, outperforming Baseline-A by 5.24 points, which strongly validates the effectiveness of hierarchical feature alignment at global feature level.

5 Conclusion

In this paper, we propose a Self-Guided Adaptation (SGA) method, and target at aligning feature representation and transferring object detection models across domains in an adversarial way while considering the instantaneous domain shift. To measure the domain shift, we design a “hardness” factor for each sample pair in each mini-batch, indicating a domain distance in a kernel space. The hardness factor is further used as a metric to select training samples and achieve progressive representation alignment. With the proposed SGA and SPS, we implement the robust and effective domain adaptive object detection, improving the state-of-the-art methods with significant margins. The research in this paper not only demonstrates the effectiveness of the SGA upon domain adaptive object detection, but also provides a fresh insight to general UDA problems.


  • [1] H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, and M. Marchand (2014) Domain-adversarial neural networks. CoRR abs/1412.4446. Cited by: §2.
  • [2] K. M. Borgwardt, A. Gretton, M. J. Rasch, H. Kriegel, B. Schölkopf, and A. J. Smola (2006) Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics 22 (14), pp. e49–e57. Cited by: §3.1.
  • [3] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan (2017) Unsupervised pixel-level domain adaptation with generative adversarial networks. In IEEE CVPR, pp. 95–104. Cited by: §2.
  • [4] Q. Cai, Y. Pan, C. Ngo, X. Tian, L. Duan, and T. Yao (2019) Exploring object relation in mean teacher for cross-domain detection. In IEEE CVPR, pp. 11457–11466. Cited by: §1, §2, Table 2.
  • [5] C. Chen, W. Xie, W. Huang, Y. Rong, X. Ding, Y. Huang, T. Xu, and J. Huang (2019) Progressive feature alignment for unsupervised domain adaptation. In IEEE CVPR, pp. 627–636. Cited by: §2.
  • [6] Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool (2018) Domain adaptive faster r-cnn for object detection in the wild. In IEEE CVPR, pp. 3339–3348. Cited by: §1, §1, §1, §2, §2, §2, Table 1, §4.2, §4.2, §4.2, §4.2, §4.4, Table 3, Table 4, footnote 2.
  • [7] S. Chopra, S. Balakrishnan, and R. Gopalan (2013)

    Dlid: deep learning for domain adaptation by interpolating between domains

    In ICML workshop on challenges in representation learning, Vol. 2. Cited by: §2.
  • [8] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    In IEEE CVPR, pp. 3213–3223. Cited by: §4.2, §4.2, §4.
  • [9] W. Deng, L. Zheng, Q. Ye, G. Kang, Y. Yang, and J. Jiao (2018) Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In IEEE CVPR, pp. 994–1003. Cited by: §2, §4.1.
  • [10] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. Int. J. Comp. Vis. 88 (2), pp. 303–338. Cited by: §4.2, §4.
  • [11] Y. Ganin and V. Lempitsky (2014)

    Unsupervised domain adaptation by backpropagation

    arXiv preprint arXiv:1409.7495. Cited by: §2, §3.1.
  • [12] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. S. Lempitsky (2016) Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17, pp. 59:1–59:35. Cited by: §2.
  • [13] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the kitti dataset. The International Journal of Robotics Research 32 (11), pp. 1231–1237. Cited by: §4.
  • [14] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE CVPR, pp. 580–587. Cited by: §1.
  • [15] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola (2012) A kernel two-sample test. J. Mach. Learn. Res. 13 (Mar), pp. 723–773. Cited by: §2, §3.1, §3.1.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE CVPR, pp. 770–778. Cited by: §4.1.
  • [17] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell (2018) CyCADA: cycle-consistent adversarial domain adaptation. In ICML, pp. 1994–2003. Cited by: §2.
  • [18] H. Hsu, C. Yao, Y. Tsai, W. Hung, H. Tseng, M. Singh, and M. Yang (2019) Progressive domain adaptation for object detection. In IEEE CVPR Workshops, pp. 1–5. Cited by: §2.
  • [19] N. Inoue, R. Furuta, T. Yamasaki, and K. Aizawa (2018) Cross-domain weakly-supervised object detection through progressive domain adaptation. In IEEE CVPR, pp. 5001–5009. Cited by: §4.2, §4.
  • [20] L. Jiang, D. Meng, Q. Zhao, S. Shan, and A. G. Hauptmann (2015) Self-paced curriculum learning. In AAAI, pp. 2694–2700. Cited by: §1.
  • [21] S. Kim, J. Choi, T. Kim, and C. Kim (2019) Self-training and adversarial background regularization for unsupervised domain adaptive one-stage object detection. In IEEE ICCV, pp. 6092–6101. Cited by: Table 1.
  • [22] T. Kim, M. Jeong, S. Kim, S. Choi, and C. Kim (2019) Diversify and match: a domain adaptive representation learning paradigm for object detection. In IEEE CVPR, pp. 12456–12465. Cited by: §2, §2.
  • [23] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In NeurIPS, pp. 1097–1105. Cited by: §1.
  • [24] M. P. Kumar, B. Packer, and D. Koller (2010) Self-paced learning for latent variable models. In NeurIPS, pp. 1189–1197. Cited by: §3.2.
  • [25] W. Li, Z. Xu, D. Xu, D. Dai, and L. Van Gool (2017) Domain generalization and adaptation using low rank exemplar svms. IEEE Trans. Patt. Anal. and Machine Intel. 40 (5), pp. 1114–1127. Cited by: §2.
  • [26] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In IEEE ICCV, pp. 2980–2988. Cited by: §1.
  • [27] M. Liu and O. Tuzel (2016) Coupled generative adversarial networks. In NeurIPS, D. D. Lee, M. Sugiyama, U. von Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 469–477. Cited by: §2.
  • [28] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In ECCV, pp. 21–37. Cited by: §1.
  • [29] M. Long, Y. Cao, J. Wang, and M. I. Jordan (2015) Learning transferable features with deep adaptation networks. arXiv preprint arXiv:1502.02791. Cited by: §1, §2, §2.
  • [30] M. Long, H. Zhu, J. Wang, and M. I. Jordan (2016) Unsupervised domain adaptation with residual transfer networks. In NeurIPS, pp. 136–144. Cited by: §2.
  • [31] S. Lyu, M. Chang, D. Du, W. Li, Y. Wei, M. Del Coco, P. Carcagnì, A. Schumann, B. Munjal, D. Choi, et al. (2018) UA-detrac 2018: report of avss2018 & iwt4s challenge on advanced traffic monitoring. In IEEE AVSS, pp. 1–6. Cited by: §4.2, §4.
  • [32] S. Motiian, M. Piccirilli, D. A. Adjeroh, and G. Doretto (2017) Unified deep supervised domain adaptation and generalization. In IEEE ICCV, pp. 5715–5725. Cited by: §2.
  • [33] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang (2010) Domain adaptation via transfer component analysis. IEEE Trans. Neural Net. 22 (2), pp. 199–210. Cited by: §1, §2.
  • [34] S. J. Pan and Q. Yang (2009) A survey on transfer learning. IEEE Trans. Know. and Data Engi. 22 (10), pp. 1345–1359. Cited by: §1, §2.
  • [35] P. Panareda Busto and J. Gall (2017) Open set domain adaptation. In IEEE ICCV, pp. 754–763. Cited by: §2.
  • [36] L. Pang, Y. Wang, Y. Song, T. Huang, and Y. Tian (2018) Cross-domain adversarial feature learning for sketch re-identification. In ACMMM, pp. 609–617. Cited by: §2.
  • [37] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §4.1.
  • [38] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NeurIPS, pp. 91–99. Cited by: §1, §1.
  • [39] K. Saito, Y. Ushiku, T. Harada, and K. Saenko (2019) Strong-weak distribution alignment for adaptive object detection. In IEEE CVPR, pp. 6956–6965. Cited by: §1, §1, §1, §2, §2, §3.1, Table 1, §4.2, §4.2, Table 2, Table 3.
  • [40] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada (2018) Maximum classifier discrepancy for unsupervised domain adaptation. In IEEE CVPR, Cited by: §2.
  • [41] C. Sakaridis, D. Dai, and L. Van Gool (2018) Semantic foggy scene understanding with synthetic data. Int, J. Comp. Vis. 126 (9), pp. 973–992. Cited by: §4.2, §4.
  • [42] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
  • [43] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §4.2.
  • [44] L. Song, K. Fukumizu, and A. Gretton (2013) Kernel embeddings of conditional distributions: a unified kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine 30 (4), pp. 98–111. Cited by: §3.1.
  • [45] A. Torralba, A. A. Efros, et al. (2011) Unbiased look at dataset bias.. In IEEE CVPR, Vol. 1, pp. 7. Cited by: §4.2.
  • [46] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko (2015) Simultaneous deep transfer across domains and tasks. In IEEE CVPR, pp. 4068–4076. Cited by: §1.
  • [47] X. Wang, Z. Cai, D. Gao, and N. Vasconcelos (2019) Towards universal object detection by domain attention. In IEEE CVPR, pp. 7289–7298. Cited by: §1.
  • [48] M. J. Wilber, C. Fang, H. Jin, A. Hertzmann, J. Collomosse, and S. Belongie (2017) Bam! the behance artistic media dataset for recognition beyond photography. In IEEE ICCV, pp. 1202–1211. Cited by: §4.2.
  • [49] Z. Zhong, L. Zheng, Z. Luo, S. Li, and Y. Yang (2019) Invariance matters: exemplar memory for domain adaptive person re-identification. In IEEE CVPR, pp. 598–607. Cited by: §2.
  • [50] X. Zhu, J. Pang, C. Yang, J. Shi, and D. Lin (2019) Adapting object detectors via selective cross-domain alignment. In IEEE CVPR, pp. 687–696. Cited by: §1, §1, §1, §2, §2, §4.2, Table 2, Table 4.