Learning a Domain Classifier Bank for Unsupervised Adaptive Object Detection

by   Sanli Tang, et al.

In real applications, object detectors based on deep networks still face challenges of the large domain gap between the labeled training data and unlabeled testing data. To reduce the gap, recent techniques are proposed by aligning the image/instance-level features between source and unlabeled target domains. However, these methods suffer from the suboptimal problem mainly because of ignoring the category information of object instances. To tackle this issue, we develop a fine-grained domain alignment approach with a well-designed domain classifier bank that achieves the instance-level alignment respecting to their categories. Specifically, we first employ the mean teacher paradigm to generate pseudo labels for unlabeled samples. Then we implement the class-level domain classifiers and group them together, called domain classifier bank, in which each domain classifier is responsible for aligning features of a specific class. We assemble the bare object detector with the proposed fine-grained domain alignment mechanism as the adaptive detector, and optimize it with a developed crossed adaptive weighting mechanism. Extensive experiments on three popular transferring benchmarks demonstrate the effectiveness of our method and achieve the new remarkable state-of-the-arts.



There are no comments yet.


page 13


MiniMax Entropy Network: Learning Category-Invariant Features for Domain Adaptation

How to effectively learn from unlabeled data from the target domain is c...

iFAN: Image-Instance Full Alignment Networks for Adaptive Object Detection

Training an object detector on a data-rich domain and applying it to a d...

Channel-wise Alignment for Adaptive Object Detection

Generic object detection has been immensely promoted by the development ...

Frequency Spectrum Augmentation Consistency for Domain Adaptive Object Detection

Domain adaptive object detection (DAOD) aims to improve the generalizati...

Multi-View Priors for Learning Detectors from Sparse Viewpoint Data

While the majority of today's object class models provide only 2D boundi...

Synergizing between Self-Training and Adversarial Learning for Domain Adaptive Object Detection

We study adapting trained object detectors to unseen domains manifesting...

Domain Adaptive YOLO for One-Stage Cross-Domain Detection

Domain shift is a major challenge for object detectors to generalize wel...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks have shown great power on various tasks

[10, 16, 21], but heavily rely on the amount of labelled data. In the real world, it is much costly to annotate a large-scale dataset especially for object detection tasks. Thus, training a model on label-rich dataset (source domain) and then transferring to the unlabelled data (target domain), namely the unsupervised domain adaptation (abbr. UDA), is a promising solution [27, 23]. For example, the auto-annotated vehicles from self-driving simulation system such as GTA can be used to help improve the vehicle detection performance in the real-world.

Early researches aim at shrinking the domain gap [39] by aligning the model’s activating responses to data from both the source and the target domains [7, 23, 36]. Inspired by the adversarial training techniques [9, 35] in image classification task, recent two methods [4, 28] attempted to directly incorporate the bare detectors with domain classifiers to extract the image-level or instance-level domain-invariant features, and achieved significant results.

Figure 1: Illustration of domain alignment. (a) shows the instance-level feature alignment regarding to the object instances in source or target domain where all instances share the same class. (b) shows the class-level alignment by considering the category labels of object instances. The hollow and solid circles are corresponding to the instance labels in target and source domains, respectively. The arrows represent the aligning directions.

However, the image-level domain alignment (abbr. ImDA) strategy, such as [28], takes no account of the significant difference in object number, size and even the layouts in different domains. This method can be treated only as a coarse solution and has very limited effects. Though the instance-level (abbr. InDA) strategy, such as [4], considers the issues in ImDA, it still suffers from the suboptimal problem due to lacking of considering their category information. To be specific, it is unreasonable to train domain classifiers by regarding all instances from source/target domain as the same class, which prevents the model from drawing a clear distinction among different object categories. As a result, the detector is easy to be confused. Figure 1 (a) illustrates InDA that each instance from target domain is aligned to its closest instance from source domain, which inevitably leads to the misalignment between different instance categories.

Considering above issues, a more promising way is to align instance features according to their ground truth or pseudo labels. It means the detected objects with higher confidence in target domain should be paid more attention on aligning the instance features regarding to their categories. For example in Figure 1(b), the target objects predicted as class-2 (denoted as brown hollow circles) should be aligned to the class-2 objects in source domain (denoted as brown solid circles). Here, we treat it as class-level domain alignment. To achieve class-level alignment in all classes, a group of domain classifiers can be established as a domain classifier bank in which each classifier takes the charge of aligning features of a specific class. In this way, the closest pairs of instances between the source and target domains refer to those objects sharing the same class from the perspective of the detector. Note that, predicted results of domain classifiers also reveal the effects of feature alignment, i.e., the more extent of alignment, the more confused prediction will be made by the domain classifiers, as addressed in [37]. Thereby, for those well-aligned features, their pseudo labels could be added more weight when training the detector on unlabeled data from the target domain. Then the detection performance can be further enhanced.

In this paper, we propose a fine-grained unsupervised domain adaptation method for object detection, which consists of a domain classifier bank integrating with a teacher-student framework. Concretely, we group the class-level domain classifiers together to form as a bank, named as DCBank, in which each classifier is responsible for aligning features of the specific class between the source and target domains. Since images in target domain are unlabeled, mean teacher [33] is employed to provide pseudo labels, e.g. the locations and classes of the objects. The generated pseudo labels can be used to train the DCBank. We integrate the bare object detector with mean teacher as well as the DCBank into an unsupervised adaptive detection framework named as MDBank, and optimize it with a crossed adaptive weighting mechanism. Here, the crossed adaptive weights are calculated from the predicting confidence of the detector as well as the entropy of the DCBank, and can improve both the detector and the DCBank.

The contributions are summarized as follows: (1) We address the class-level domain adaptation problem, and develop the domain classifier bank mechanism to align instance-level features according to their categories. (2) We assemble a bare detector with mean teacher as well as the designed DCBank into an adaptive detection framework. The whole framework is jointly optimized with a crossed weighting strategy which can improve both the detector and DCBank. (3) Extensive experiments on three popular datasets demonstrate the effectiveness of our method.

2 Related works

2.1 Object Detection

Object detectors based on deep neural networks can be roughly divided into two categories: the two-stage and the one-stage. Faster R-CNN [26]

is a representative two-stage detector, where a Region Proposal Network (RPN) is designed to provide object proposals, e.g. the coarse bounding boxes and the probabilities of their being the foreground category. Then the cropped and resized features are fed into a classifier in the second stage to predict their categories and refined locations. A series of improvements

[3, 11, 17] based on Faster R-CNN have also been explored to further boost the performance. While for one-stage detectors, YOLO [24] directly regressed the bounding boxes and the confidence of being multiple categories, which achieved competitive performance in a high efficiency manner. SSD [20] aimed to increase the detection rates of objects in different scales, especially the small ones by predicting from multiple feature maps at different resolutions. After that, [19, 25, 34] further advanced the one-stage detectors by revising the network structure or applying delicate training skills.

2.2 Domain Adaptation

Many researches [7, 30, 36] on domain adaptation struggle for bridging the gap between the source and target domain. Earlier works tried to minimize the discrepancy between two domains, which was defined in statistics, e.g. the Maximum Mean Discrepancy (MMD) [1, 22, 38] or CORAL distance [31, 32]. Recent methods [35] based on adversarial training aligned the feature distribution by cheating domain classifiers that were trained to distinguish the image features from different domain. [37] proposed the transferable attention that assigns different weights to feature maps according to the predicting confidence of the domain classifiers. Derived from Mean Teacher [33]

in semi-supervised learning, self-ensembling

[7] was proposed to extract domain-invariant features by minimizing the outputs of the teacher and the student with augmented inputs. All above methods were examined in image classification tasks.

Recently, researchers start to pay attention to domain adaptation in object detection. In general, existing methods could be summarized as three types of domain alignments: the input-level, the feature-level and the output-level. (1) The input-level aligning techniques usually adopted generative models to directly transfer input images from the source domain to the target domain while keeping the labels unchanged [13]. Then, the generated labeled images could be utilized to train a detector in a fully supervised manner. (2) For feature-level alignment, [4] aligned both top features in backbone and instance features by adversarial training, where the instance-level domain classifier treated the object features as the same class only if they come from the same domain. Strong weak domain adaptation (SWDA) [28] argued that the precisely matching on global features was likely to hurt the performance confronting with large domain gap, and adopted a weak image-level domain classifier by focal loss [19] to align features. (3) For output alignment, mean teacher with object relation (MTOR) [2] made three kinds of consistency regularization based on two relational graphs in teacher and student networks, which showed a promising way of self-ensembling framework in UDA object detection tasks. [15] addressed the UDA detection problem by training a detector on the target domain with noisy object bounding box. Here, we focus on the feature-level domain alignment.

Unlike previous feature-level alignment methods in which aligning domain features regardless of their classes, in this paper, we try to achieve a fine-grained instance-level domain alignment by regarding to instance categories.

3 Preliminary Work

We build the proposed framework MDBank based on Faster R-CNN and mean teacher, as illustrated in Figure 2. Notice that we select Faster R-CNN [26] as the bare detector for fair comparison with previous methods [2, 4, 28].

Faster R-CNN detector. Faster R-CNN [26] is a two-stage detector, which consists of a feature extractor backbone , a region proposal network (RPN)

and a region convolutional neural network (RCNN)

. For the input data , the image-level features are first calculated by , and the object proposals are represented by . Then the instance-level features is obtained according to , and the bounding boxes b and the category probabilities p can be predicted by .

Mean Teacher in Faster R-CNN. Mean teacher [33] is used to provide relatively robust pseudo labels for unlabelled samples, which is established as the same network structure to the student . Its parameters at -th iteration are updated in a moving average manner: , where are student’s parameters at -th iteration and is the moving average factor for controlling the update speed of the teacher. Following [2], the proposals from the teacher are fed into of both the student and teacher detectors, respectively.

Figure 2: The Faster R-CNN with mean teacher framework. For data from the source domain, it is trained in a supervised routine by minimizing the detecting objectives in Faster R-CNN [26]. For unlabeled data from the target domain, it is trained by optimizing the consistency regularization between the teacher’s and the student’s prediction. Teacher detector shares its proposals with the student when training on the unlabeled data.

For the labeled data from source domain , the normal supervised routine in [26] is applied to train the student detector by minimizing the supervised detection objective . For the unlabeled data from target domain , the teacher model is used to obtain object proposals and their pseudo labels . Then the augmented input is fed into the student detector along with to obtain the predictions . Thus, the consistency regularization of mean teacher could be calculated as following:


where and are consistency objectives for the classification and bounding box regression between the teacher and the student detector, respectively.

4 Methodology

Figure 3: The architecture of MDBank for UDA detection tasks. Based on Faster R-CNN detector and mean teacher, the teacher detector shares the proposals with the student to further align the instance-level feature in the same region. DCBank consists of a group of domain classifiers that performs class-level feature alignment in an adversarial learning manner. Crossed adaptive weighting mechanism is applied on instance-level feature between the consistency regularization in mean teacher and adversarial objective in DCBank. The confidence from the second stage of teacher detector is used as a gate function for training the DCBank module while the entropies from the DCBank weight the consistency objective of different categories, respectively. The ’w-ent’ and ’w-mse’ are the abbreviations of weighted cross-entropy and weighted mean square error, respectively.

In this section, we describe the proposed framework MDBank specifically, shown in Figure 3.

4.1 Domain Classifier Bank

The existing adaptive detection methods [4, 28] used a single domain classifier to align instance-level features of different labels, which might prevent the detectors from distinguishing their categories. To align the features according to their categories, a group of domain classifiers are established and each of them is responsible for aligning features of a specific class.

Formally, we establish the domain classifier bank (DCBank) as , where denotes the number of domain classifiers corresponding to object categories and the background category. All classifiers don’t share parameters with each other. These domain classifiers are trained to distinguish the instance level features from either the source domain or the target domain. For the instance level feature from source domain with category label , only the domain classifier in the DCBank is activated to align the region features. For the instance level feature from the target domain with unknown category, the prediction of the teacher model is referred as the pseudo label to activate the specific domain classifiers to align the region features, as illustrated in Figure 3. For example, the category of maximum predicted confidence can be regard as its pseudo label such that only the -th domain classifier is adopted to make the instance level feature alignment. In fact, a more soft and robust way is to simultaneously select several domain classifiers for domain alignment according to the uncertainty of the teacher model (detailed in Section 4.2).

According to the label of the instance-level features, domain classifiers can be trained by minimizing the objective :



is the activation function to decide which domain classifier is trained. And the instance-level and class-level feature alignment is achieved by adversarial training:


That is, the domain classifiers in DCBank try to distinguish the domain label of the instance-level features conditioned by their GT/pesudo labels, while the feature extractors are trained to generate domain-invariant features to cheat those classifiers. Inspired by the Gradient Reversal Layer (GRL) [8] where signs of output gradients are flipped, the adversarial loss in Equation 3 can be easily implemented by adding GRL onto the instance-level features before the domain classifier bank module.

Since the DCBank module is deployed to align instance-level features, which are much smaller than the image-level feature, it will add a little storage and computational effort when training. When testing, it can be completely omitted without any addition cost.

4.2 Crossed Adaptive Weighting

Moreover, domain classifiers in DCBank can be trained in a robust manner by incorporating the prediction confidence from the teacher model such that a soft gate function can be deployed to weight the domain classifiers. Here, we define two types of the gate function:


where only activates the domain classifier at the index of maximum prediction score, and activates all the domain classifiers regarding to their confidence score. Since prediction of the teacher model might be incorrect, DCBank with the soft gate function is believed to be more robust comparing to the hard gate function .

In the meanwhile, the prediction of DCBank can reveal the extent of current instance-level feature to be aligned. The more extent of the feature alignment, the more consistency should be made between the teacher and the student. Since the domain classifier bank plays a discriminator role by adversarial training, the entropy of the prediction could be used to weight the consistency discrepancy regularization. Formally, given the output scores of the DCBanks where is the output of the -th domain classifier predicted on the -th region-level feature, the entropy of the domain classifier bank can be calculated as


Therefore, the consistency regularization in Equation 1 can be rewritten as


where represents the element-wise product. For simplicity, we use norm to measure both the classification and bounding box regression discrepancy between the teacher and student detector. In Figure 3, we illustrate the adaptive weighting mechanism between the consistency regularization and the adversarial objective.

4.3 Overall Objective

The overall objective consists of three parts: the normal supervised routine of Faster R-CNN , the weighted instance level consistency regularization from mean teacher and the adversarial objective from DCBank :


in which the parameter controls the weight of MDBank training on the unlabelled data from the target domain, is the trade-off between output-level alignment from teacher detector and the class-level alignment from MDBank.

5 Experiments

We evaluate the proposed method on three public domain shift benchmarks: Sim10k [14] to CityScape [5], PASCAL VOC [6] to Clipart [13] and CityScapes to Foggy CityScapes [29], which represent three scenarios: synthetic to real scenario, normal to foggy weather scenario and photographic to comic scenario, respectively.

5.1 Dataset

SIM10K is a synthetic dataset containing 10,000 training images collected from a synthetic driving game Grand Theft Auto V (GTA5) with bounding box annotation only for cars. CityScapes is an urban street dataset whose images are captured by a car-mounted camera. Since these images are annotated in pixel for semantic segmentation task, following [4, 28], we generate the tightest axis-aligned rectangles as the bounding box labels according to the instance segmentation mask. Foggy CityScapes is established based on CityScapes where the images are synthetically rendered with fog according to the depth map, in which each image is rendered in three different levels: . Following [2, 28], the heaviest foggy images () are used in our experiments. PASCAL VOC contains images of 20 categories with bounding box annotations. Following the common evaluation protocol, we use both the training and validation split of PASCAL VOC 2007 and 2012 for training, which leads to about 15k images. Clipart contains 1k comic images sharing the same categories as the PASCAL VOC dataset. Following the setting in [28], all images are used for unsupervised training and testing.

5.2 Implement Details

The backbone network of Faster-RCNN is implemented by ResNet-50 [12] with a Feature Pyramid Network (FPN) [18]. In training stage, each image is resized as the short size between (960, 1440) pixels while keeping the ratio unchanged. In particular, images for student model are applied additional augmentation by randomly adjusting the image contrast in scales, the saturation in scales and the brightness in RGB value. Notice that the above data augmentations keep the objects’ categories and bounding boxes unchanged. The proposed model is trained on 2 GPUs with batch-size=2. We follow the hyper-parameter setting of the base Faster R-CNN to train the images from source domain. While in target domain, since the labels are unknown, we select the top 512 proposals with highest confidence scores produced by RPN. Then, those instance-level features are further aligned in the proposed DCBank module. The moving average weight for updating the teacher model is set to by default.

5.3 Declaration for Fair Comparison

We verify our method by considering the following settings:

  • The origin faster R-CNN model trained on the source domain without any adaptation is treated our baseline, denoted as Faster.

  • To evaluate the effectiveness of DCBank, we replace the DCBank with a single instance-level domain classifier without class-level alignment, denoted as .

  • To evaluate the effectiveness of crossed adaptive weighting mechanism, we removes the entropy weighting for consistency regularization and replace confidence weighting for adversarial objective by hard label as the gate function in Equation 4, denoted as .

  • The Oracle model directly trained on the target domain in a fully supervised manner, denoted as Oracle.

The performance of Source/Target only can be regarded as the lower/upper bound without domain adaptation strategy, which are trained in fully supervised manner.

We also compare MDBank with current state-of-the-arts: (1) DA [4] using image-level and instance-level domain classifiers as well as a consistency regularizer, (2) SWDA [28]

adopting strong local and weak global feature alignment as well as a context-vector based regularization, and (3) MTOR

[2] incorporating instance-level, inter-graph and intra-graph consistency based on a relationship graph without domain classifiers. For fairly comparison, we here re-implement DA, SWDA and MTOR with region level consistency with the same ResNet-50 with FPN structure.

5.4 Cross Domain Detection

method G I C AW person rider car truck bus train mcycle bicycle mAP
Faster 30.6 41.5 40.2 6.2 38.3 48.9 7.1 13.8 28.3
DA [4] 39.4 48.1 48.8 31.0 42.9 54.9 7.8 18.1 36.4
SWDA [28] 45.8 49.2 56.2 31.1 47.0 57.5 11.2 21.9 40.0
MTOR [2] 40.4 49.7 57.6 30.1 47.9 58.6 16.9 27.1 41.0
46.1 48.4 54.2 30.8 45.9 56.4 20.9 25.9 41.1
45.8 49.9 57.2 32.9 49.3 59.1 21.0 29.1 43.0
MDBank 44.3 50.0 58.4 34.9 48.7 59.1 26.1 28.7 43.8
Oracle 46.4 54.0 65.7 41.3 54.6 64.8 34.4 30.5 48.9
Table 1: Experiment results on CityScapes to Foggy CityScapes transfer. The mean average precision (mAP) is evaluated on 8 categories under Foggy CityScapes validation set. The notations G, I, C and AW separately indicate the global-level (or image-level) alignment, instance-level alignment, class-level alignment and adaptive weighting, respectively.

5.4.1 Normal to foggy weather images

The comparison results for normal weather to foggy weather (CityScapes to Foggy CityScape) domain are summarized in Table 1, which illustrates the AP of 8 common objects in urban street scenario and their mean AP (mAP). Our MDBank achieves the best 43.8% mAP with the margin 2.8% comparing to the second runner MTOR [2]. Since the foggy images in target domain are rendered directly from source domain, the domain gap between them is smaller than other scenarios. Naturally, the teacher model can provide more convincible predicting results on unlabeled images. Thus the proposed MDBank with hard pseudo label shows similar performance as using soft label 43.0% v.s. 43.8% mAP. We also note that MTOR and achieve similar performance, which indicates that applying a single domain classifier to simply align instance-level feature has limited effects for adaptive object detection. In summary, with the help of DCBank, our MDBank surpasses MTOR and by around 2.7% mAP. In Figure 5, we illustrate the detection result under Foggy Cityscapes.

method G I C AW car AP on target
Faster 34.9
DA [4] 43.1
SWDA [28] 47.8
MTOR [2] 54.9
MDBank 56.3
Oracle 65.9
Table 2: Experiment results on SIM10K to CityScapes transfer. The average precision (AP) is evaluated on the car category. The notation G, I, C, AW is following Table 1.

5.4.2 Synthetic to real images

We first evaluate the performance of MDBank on the synthetic (SIM10K) to real (CityScapes) domain. Table 2 shows the performance under CityScapes validation set of average precision rate (AP) on car category. Since there are only one category (car) to be detected, the MDBank performs closed to the and , where DCBank only consists of two domain classifiers. Own to the class-level feature alignment by DCBank, our MDBank still outperforms MTOR by 1.4%, which achieves the best 56.3% of AP.

5.4.3 Photographic to comic images

In this experiment, we analyze our method by evaluating on images from real (PASCAL VOC) to artistic (Clipart) domain. The evaluation results are shown in Table 3 with the AP of 20 common objects and their mAP. Though the source and target domain are much dissimilar, as illustrated in Figure 5, our MDBank method also achieves best performance of 45.4% mAP, which improves about 3% comparing to current best result achieved by MTOR. Note that even performs worse than the MTOR, which reveals that the simple instance-level alignment regardless of categories information might have side effect as the number of category growing. MDBank surpassing by 1.5% verifies the effectiveness of the crossed adaptive weighting strategy. Since the all images are used for unsupervised adaptive training [28], the performance of the oracle is meaningless that it is trained on the validation set in a fully supervised manner.

method Faster[26] DA[4] SWDA[28] MTOR[2] MDBank Oracle
areo 18.6 12.7 37.0 39.9 33.5 40.2 42.5 62.5
bycle 34.6 40.3 67.2 74.5 67.1 76.0 70.4 77.8
bird 17.1 21.4 30.6 22.8 24.8 26.3 37.8 77.0
boat 11.2 16.2 28.3 39.4 42.7 39.4 38.7 60.5
bottle 23.8 35.7 44.6 45.8 51.3 38.5 48.8 61.1
bus 44.3 29.3 65.0 58.8 55.9 69.1 58.0 86.6
car 23.4 29.6 41.9 56.3 56.2 59.1 57.7 81.8
cat 11.3 0.6 13.1 13.0 13.2 9.7 19.5 67.1
chair 35.5 29.6 52.2 56.9 58.0 58.2 52.9 77.9
cow 5.2 39.1 42.7 18.6 31.2 36.9 30.6 76.9
table 22.8 18.4 22.3 36.7 34.9 31.6 36.5 77.4
dog 6.0 10.9 7.2 5.8 13.9 10.8 15.3 75.1
horse 21.2 19.7 26.9 29.4 30.0 32.2 38.0 74.8
bike 40.8 61.5 76.3 79.0 58.9 80.2 85.1 87.2
person 29.0 54.1 53.1 58.7 63.9 65.8 66.3 85.5
plant 35.4 41.2 55.0 64.5 57.1 56.9 57.1 73.2
sheep 0.4 16.9 11.2 11.4 8.9 18.3 17.3 76.6
sofa 18.2 16.5 24.1 19.9 25.6 24.8 17.2 69.0
train 24.9 16.3 48.9 56.3 45.1 41.9 58.5 70.4
tv 22.8 33.7 51.2 62.0 58.6 61.2 59.4 81.3
mAP 22.3 27.2 40.0 42.5 41.5 43.9 45.4 75.0
Table 3: Experiment results on dissimilar domain transfer from PASCAL VOC to Clipart datasets. The mean average precision (mAP) is evaluated on 20 categories on all 1k images in Clipart. The mAP of oracle is only for referenced since all the images from target domain are used for training following the setting in [28]. The table is transposed for fully visualization.
Figure 4: The effect of the trade off parameters and under CityScapes to Foggy CityScapes transfer.

5.4.4 Analysis of hyper-parameter and

controls the trade-off of the student detector learned from the source and target domain, while weights the objective of the DCBank module. Figure 4 shows the mAP under different value of hyper-parameters and , respectively. We find that MDBank is relatively robust to the trade off parameter in a wide range. MDBank can achieve the best 44.1% mAP when and .

5.4.5 Visualization on the target domain

In Figure 5, we illustrate examples of detection results by the proposed MDBank on the target domain. The detection results show that the MDBank is relative robust to the similar and dissimilar domain transfer. In Figure 6, we show the instance-level feature distribution of the proposed MDBank and DA [4]. Notice that DA applies instance-level domain alignment without considering their categories. Though both MDBank and DA achieve similar alignment referring to their domain labels as in Figure 6 (a) and (c), our MDBank shows more distinguishable boundaries among instances of different categories as illustrated in 6 (d). Since there are much more instance categories than the previous transfer tasks, MDBank outperforms DA with a large gap by applying the class-level feature alignment.

Figure 5: The examples of detection results from the proposed MDBank on the target domain. From left to right is PASCAL VOC to Clipart, SIM10K to CityScapes and CityScapes to Foggy CityScapes transfer. For clarity, the confidence score is omitted on images from CityScapes and Foggy Cityscapes. Notice that ’car’ is the only object to be detected in SIM10K to CityScapes transfer.
(a) DA/domain
(b) DA/category
(c) MDBank/domain
(d) MDBank/category
Figure 6: Evidences of instance-level feature distribution on PASCAL VOC to Clipart transfer. Features belonging to the first 10 classes in Table 1 are illustrated for better visualization. Here we compare the proposed MDBank and DA[4]. (a) and (c) show the instance-level feature distribution with red/blue color representing instances from source/target domain. (b) and (d) show their category labels with different colors. Although the instance-level features are well-aligned both in MDBank and DA by referring to (a) and (c), the MDBank achieves more accurate class-level alignment by referring to (b) and (d).

6 Conclusions

In this paper, we present an MDBank framework including a mean teacher with a domain classifier bank for domain adaptation object detection problem in unsupervised manner. Our key contribution is the domain classifier bank module that respectively aligns the instance-level features according to their category labels. To align the unlabelled data from target domain, a mean teacher paradigm is incorporated to provide robust pseudo labels while applying instance-level prediction consistency between the teacher and student detector. Besides, a crossed weighting mechanism is then proposed to adaptively connect the DCBank and mean teacher to boost their performance. Experiment shows that our MDBank achieves the new state-of-the-arts on CityScapes, Foggy CityScapes, SIM10K, PASCAL VOC and Clipart Datasets for unsupervised domain adaptation detection.


  • [1] K. M. Borgwardt, A. Gretton, M. J. Rasch, H. Kriegel, B. Schölkopf, and A. J. Smola (2006) Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics 22 (14), pp. e49–e57. Cited by: §2.2.
  • [2] Q. Cai, Y. Pan, C. Ngo, X. Tian, L. Duan, and T. Yao (2019) Exploring object relation in mean teacher for cross-domain detection. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 11457–11466. Cited by: §2.2, §3, §3, §5.1, §5.3, §5.4.1, Table 1, Table 2, Table 3.
  • [3] Z. Cai and N. Vasconcelos (2018) Cascade r-cnn: delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6154–6162. Cited by: §2.1.
  • [4] Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool (2018) Domain adaptive faster r-cnn for object detection in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3339–3348. Cited by: §1, §1, §2.2, §3, §4.1, Figure 6, §5.1, §5.3, §5.4.5, Table 1, Table 2, Table 3.
  • [5] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: §5.
  • [6] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §5.
  • [7] G. French, M. Mackiewicz, and M. Fisher (2017) Self-ensembling for domain adaptation. arXiv preprint arXiv:1706.05208. Cited by: §1, §2.2.
  • [8] Y. Ganin and V. Lempitsky (2014)

    Unsupervised domain adaptation by backpropagation

    arXiv preprint arXiv:1409.7495. Cited by: §4.1.
  • [9] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks.

    The Journal of Machine Learning Research

    17 (1), pp. 2096–2030.
    Cited by: §1.
  • [10] R. Girshick (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §1.
  • [11] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick (2017) Mask r-cnn. 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988. Cited by: §2.1.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §5.2.
  • [13] N. Inoue, R. Furuta, T. Yamasaki, and K. Aizawa (2018) Cross-domain weakly-supervised object detection through progressive domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5001–5009. Cited by: §2.2, §5.
  • [14] M. Johnson-Roberson, C. Barto, R. Mehta, S. N. Sridhar, K. Rosaen, and R. Vasudevan (2016) Driving in the matrix: can virtual worlds replace human-generated annotations for real world tasks?. arXiv preprint arXiv:1610.01983. Cited by: §5.
  • [15] M. Khodabandeh, A. Vahdat, M. Ranjbar, and W. G. Macready (2019) A robust learning approach to domain adaptive object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 480–490. Cited by: §2.2.
  • [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
  • [17] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun (2017) Light-head r-cnn: in defense of two-stage object detector. arXiv preprint arXiv:1711.07264. Cited by: §2.1.
  • [18] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §5.2.
  • [19] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §2.1, §2.2.
  • [20] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §2.1.
  • [21] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §1.
  • [22] M. Long, H. Zhu, J. Wang, and M. I. Jordan (2017)

    Deep transfer learning with joint adaptation networks

    In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2208–2217. Cited by: §2.2.
  • [23] A. Raj, V. P. Namboodiri, and T. Tuytelaars (2015) Subspace alignment based domain adaptation for rcnn detector. arXiv preprint arXiv:1507.05578. Cited by: §1, §1.
  • [24] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §2.1.
  • [25] J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §2.1.
  • [26] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §2.1, Figure 2, §3, §3, §3, Table 3.
  • [27] K. Saenko, B. Kulis, M. Fritz, and T. Darrell (2010) Adapting visual category models to new domains. In European conference on computer vision, pp. 213–226. Cited by: §1.
  • [28] K. Saito, Y. Ushiku, T. Harada, and K. Saenko (2019) Strong-weak distribution alignment for adaptive object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6956–6965. Cited by: §1, §1, §2.2, §3, §4.1, §5.1, §5.3, §5.4.3, Table 1, Table 2, Table 3.
  • [29] C. Sakaridis, D. Dai, and L. Van Gool (2018) Semantic foggy scene understanding with synthetic data. International Journal of Computer Vision 126 (9), pp. 973–992. Cited by: §5.
  • [30] S. Sankaranarayanan, Y. Balaji, C. D. Castillo, and R. Chellappa (2018) Generate to adapt: aligning domains using generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8503–8512. Cited by: §2.2.
  • [31] B. Sun, J. Feng, and K. Saenko (2016) Return of frustratingly easy domain adaptation. In

    Thirtieth AAAI Conference on Artificial Intelligence

    Cited by: §2.2.
  • [32] B. Sun and K. Saenko (2016) Deep coral: correlation alignment for deep domain adaptation. In European conference on computer vision, pp. 443–450. Cited by: §2.2.
  • [33] A. Tarvainen and H. Valpola (2017)

    Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results

    In NIPS, Cited by: §1, §2.2, §3.
  • [34] Z. Tian, C. Shen, H. Chen, and T. He (2019) Fcos: fully convolutional one-stage object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9627–9636. Cited by: §2.1.
  • [35] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell (2017) Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7167–7176. Cited by: §1, §2.2.
  • [36] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell (2014) Deep domain confusion: maximizing for domain invariance. arXiv preprint arXiv:1412.3474. Cited by: §1, §2.2.
  • [37] X. Wang, L. Li, W. Ye, M. Long, and J. Wang (2019) Transferable attention for domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 5345–5352. Cited by: §1, §2.2.
  • [38] H. Yan, Y. Ding, P. Li, Q. Wang, Y. Xu, and W. Zuo (2017) Mind the class weight bias: weighted maximum mean discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2272–2281. Cited by: §2.2.
  • [39] T. Yao, C. Ngo, and S. Zhu (2012) Predicting domain adaptivity: redo or recycle?. In Proceedings of the 20th ACM international conference on Multimedia, pp. 821–824. Cited by: §1.