Adaptive Object Detection with Dual Multi-Label Prediction

by   Zhen Zhao, et al.

In this paper, we propose a novel end-to-end unsupervised deep domain adaptation model for adaptive object detection by exploiting multi-label object recognition as a dual auxiliary task. The model exploits multi-label prediction to reveal the object category information in each image and then uses the prediction results to perform conditional adversarial global feature alignment, such that the multi-modal structure of image features can be tackled to bridge the domain divergence at the global feature level while preserving the discriminability of the features. Moreover, we introduce a prediction consistency regularization mechanism to assist object detection, which uses the multi-label prediction results as an auxiliary regularization information to ensure consistent object category discoveries between the object recognition task and the object detection task. Experiments are conducted on a few benchmark datasets and the results show the proposed model outperforms the state-of-the-art comparison methods.



page 1

page 3

page 8


Bi-Dimensional Feature Alignment for Cross-Domain Object Detection

Recently the problem of cross-domain object detection has started drawin...

CASENet: Deep Category-Aware Semantic Edge Detection

Boundary and edge cues are highly beneficial in improving a wide variety...

Deep Active Object Recognition by Joint Label and Action Prediction

An active object recognition system has the advantage of being able to a...

X-model: Improving Data Efficiency in Deep Learning with A Minimax Model

To mitigate the burden of data labeling, we aim at improving data effici...

MLMA-Net: multi-level multi-attentional learning for multi-label object detection in textile defect images

For the sake of recognizing and classifying textile defects, deep learni...

Self-informed neural network structure learning

We study the problem of large scale, multi-label visual recognition with...

Stacking With Auxiliary Features

Ensembling methods are well known for improving prediction accuracy. How...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The success of deep learning models has led to great advances for many computer vision tasks, including image classification 

[33, 34, 15], image segmentation [23, 41] and object detection [11, 28, 22, 27]

. The smooth deployment of the deep models typically assumes a standard supervised learning setting, where a sufficient amount of labeled data is available for model training and the training and test images come from the same data source and distribution. However, in practical applications, the training and test images can come from different domains that exhibit obvious deviations. For example, the Figure 

1 demonstrates images from domains with different image styles, which obviously present different visual appearances and data distributions. The violation of the i.i.d sampling principle across training and test data prevents effective deployment of supervised learning techniques, while acquiring new labeled data in each test domain is costly and impractical. To address this problem, unsupervised domain adaptation has recently received increasing attention [9, 37, 24, 3].

Figure 1: (a) and (b) are images from real scenes and virtual scenes respectively. It is obvious that the visual appearances of the images from different domains are very different, even if they contain the same categories of objects.

Unsupervised domain adaptation aims to adapt information from a label-rich source domain to learn prediction models in a target domain that only has unlabeled instances. Although many unsupervised domain adaptation methods have been developed for simpler image classification and segmentation tasks [9, 24, 3, 40, 35, 36], much fewer domain adaptation works have been done on the more complex object detection task, which requires recognizing both the objects and their specific locations. The authors of [1] propose a domain adaptive faster R-CNN model for cross-domain object detection, which employs the adversarial domain adaptation technique [9] to align cross-domain features at both the image-level and instance-level to bridge data distribution gaps. This adaptive faster R-CNN method presents some promising good results. However, due to the typical presence of multiple objects in each image, as shown in Figure 1, both the image-level and instance-level feature alignments can be problematic without considering specific objects contained. The more recent work [30] proposes to address the problem of global (image-level) feature alignment by incorporating an additional local feature alignment under a strong-weak alignment framework for cross-domain object detection, which effectively improved the performance of the domain adaptive faster R-CNN. Nevertheless, this work still fails to take the latent object category information into account for cross-domain feature alignment. With noisy background and various objects, the information carried by the whole image is complex and the overall features of an image can have complex multimodal structures. Aiming to learn an accurate object detector in the target domain, it is important to induce feature representations that minimize the cross-domain feature distribution gaps, while preserving the cross-category feature distribution gaps.

In light of the problem analysis above, in this paper we propose a novel end-to-end unsupervised deep domain adaptation model, Multi-label Conditional distribution Alignment and detection Regularization model (MCAR), for multi-object detection, where the images in the target domain are entirely unannotated. The model exploits multi-label prediction as an auxiliary dual task to reveal the object category information in each image and then uses this information as additional input to perform conditional adversarial cross-domain feature alignment. Such a conditional feature alignment is expected to improve the discriminability of the induced features while bridging the cross-domain representation gaps to increase the transferability and domain invariance of features. Moreover, as object recognition is typically easier to solve and can yield higher accuracy than the more complex object detection task, we introduce a consistency regularization mechanism to assist object detection, which uses the multi-label prediction results as auxiliary regularization information for the object detection part to ensure consistent object category discoveries between the object recognition and the object detection tasks.

The contribution of this work can be summarized as follows: (1) This is the first work that exploits multi-label prediction as an auxiliary dual task for the multi-object detection task. (2) We deploy a novel multi-label conditional adversarial cross-domain feature alignment methodology to bridge domain divergence while preserving the discriminability of the features. (3) We introduce a novel prediction consistency regularization mechanism to improve the detection accuracy. (4) We conduct extensive experiments on multiple adaptive multi-object detection tasks by comparing the proposed model with existing methods, and demonstrate effective empirical results for the proposed model.

2 Related Work

Object Detection.

Detection models have benefited from using advanced convolutional neural networks as feature extractors. Many widely used detection methods are two-stage methods based on the region of interest (ROI)

[10, 11, 28]. The RCNN in [10] is the first detection model that deploys the ROI for object detection. It extracts features independently from each region of interest in the image, instead of using the sliding window and manual feature design in traditional object detection methods. Later, the author of [11] proposed a Fast-RCNN detection model, which adopts ROI pooling operation to shares the convolution layers between all ROIs and improve the detection speed and accuracy. The work in [28] made further improvements and proposed the Faster-RCNN, which combines Region Proposal Network (RPN) with Fast-RCNN to replace selective search and further improve detection performance. Faster-RCNN provides a foundation for many subsequent research studies  [22, 5, 20, 14, 27]. In this work and many related unsupervised domain adaptation methods, the widely used two-stage method, Faster-RCNN, is adopted as the backbone detection model.

Unsupervised Domain Adaptation. Unsupervised domain adaptation has attracted a lot of attention in computer vision research community and made great progress  [9, 29, 24, 19, 6, 32]. The main idea employed in these works is to learn feature representations that align distributions across domains. For example, the work in [9] adopts the principle of generative adversarial networks (GANs) [13] through a gradient reversal layer (GRL) [8] to achieve cross-domain feature alignment. The work in [24]

further extends adversarial adaptation into conditional adversarial domain adaptation by taking the classifier prediction into account. The works in 

[29, 2] use image generation to realize cross-domain feature transformation and align the source and target domains. Moreover, some other works adopt distance metric learning methods, such as asymmetric metric learning [19], maximum mean discrepancy (MMD) minimization [6] and Wasserstein distance minimization [32], to achieve domain alignment. Nevertheless, these studies focus on the simpler image classification and segmentation tasks.

Figure 2: The structure of the proposed MCAR model. Conditional adversarial global feature alignment is conducted through a domain discriminator by using multi-label prediction results as object category input. Meanwhile, multi-label prediction results are also used to provide a prediction consistency regularization mechanism on object recognition over each considered proposed region after the RPN.

Adaptive Object Detection. Recently domain adaptation for object detection has started drawing attentions. The work in [1] proposes an adaptive Faster-RCNN method that uses adversarial gradient reversal to achieve image-level and instance-level feature alignment for adaptive cross-domain object detection. [17] adopts image transformation and exploits pseudo labels to realize a weakly supervised cross-domain detection. The work in [18] leverages multi-style image generation between multiple domains to achieve cross-domain object detection. The authors of [30] propose a strong and weak alignment of local and global features to improve cross-domain object detection performance. [43] focuses on relevant areas for selective cross-domain alignment. [16] adopts hierarchical domain feature alignment while adding a scale reduction module and weighted gradient reversal layer to achieve domain invariance. [38] adopts multi-level local global feature adversary to achieve domain adaptation. Nevertheless, all these methods are limited to cross-domain feature alignment, while failing to take the latent object category information into account when performing feature alignment. Our proposed model employs multi-label object recognition as an auxiliary task and uses it to achieve conditional feature alignment and detection regularization.

3 Method

In this section, we present the proposed Multi-label Conditional distribution Alignment and detection Regularization model (MCAR) for cross-domain adaptive object detection. We assume there are two domains from different sources and with different distributions. The source domain is fully annotated for object detection and the target domain is entirely unannotated. Let denote the annotated images from the source domain, where denotes the -th image, and denote the bounding boxes’ coordinates and the category labels of the corresponding objects contained in the image respectively. Let denote the unannotated images from the target domain. We assume in total classes of objects are presented in images of both the source and target domains. We aim to train an object detection model by exploiting the available data from both domains such that the model can have good detection performance in the target domain.

The main idea of the proposed MCAR model is to exploit multi-label prediction (for multi-object recognition) as an auxiliary task and use it to perform both conditional adversarial cross-domain feature alignment and prediction consistency regularization for the target object detection task. This end-to-end deep learning model adopts the widely used Faster-RCNN as the backbone detection network. Its structure is presented in Figure 2. Following this structure, we present the model in detail below.

3.1 Multi-Label Prediction

The major difference between object recognition and object detection lies in that the former task only needs to recognize the presence of any object category in the given image, while the latter task needs to identify each specific object and its location in the image. The cross-domain divergence in image features that impacts the object recognition task can also consequently degrade the detection performance, since it will affect the region proposal network and the regional local object classification. Therefore we propose to deploy a simpler task of object recognition to help extract suitable image-level features that can bridge the distribution gap between the source and target domains, while being discriminative for recognizing objects.

In particular, we treat the object recognition task as a multi-label prediction problem [39, 12]

. It takes the global image-level features produced by the feature extraction network

of the Faster-RCNN model as input, and predicts the presence of object category using binary classifier networks

. These classifiers can be learned on the annotated images in the source domain, where the global object category label indicator vector

for the -th image can be gathered from its bounding boxes’ labels through a fixed transformation operation function , which simply finds all the existing object categories in and represents their presence using . The multi-label classifiers can then be learned by minimizing the following cross-entropy loss:


where each -th entry of the prediction output vector is produced from the -th binary classifier:


which indicates the probability of the presence of objects from the

-th class.

The multi-label classifiers work on the global features extracted before the RPN of the Faster-RCNN. For Faster-RCNN based object detection, these global features will be used through RPNs to extract region proposals and then perform object classification and bounding box regression on the proposed regions. In the source domain, supervision information such as bounding boxes and the object labels are provided for training the detector, while in the target domain, the detection is purely based on the global features extracted and the detection model parameters (for RPN, region classifiers and regressors) obtained in the source domain. Hence it is very important to bridge the domain gap at the global feature level. Moreover, image features that led to good global object recogntion performance are also expected to be informative for the local object classification on proposed regions. Therefore we will exploit multi-label prediction for global feature alignment and regional object prediction regularization.

3.2 Conditional Adversarial Feature Alignment

The popular generative adversarial network (GAN) [13] has shown that two distributions can be aligned by using a discriminator as an adversary to play a minimax two-player game. Following the same principle, conditional adversary is designed to take label category information into account. It has been suggested in [24, 26] that the cross-covariance of the predicted category information and the global image features can be helpful for avoiding partial alignment and achieving multimodal feature distribution alignment. We propose to integrate the multi-label prediction results together with the global image features extracted by to perform conditional adversarial feature alignment at the global image level. The key component network introduced is the domain discriminator , which predicts the domain of the input image instance, with label 1 indicating the source domain and 0 indicating the target domain. As shown in Figure 2, the discriminator consists of a convolution filter layer , which reduces the dimension of the input features, and a fully connected layer , which integrates the inputs to perform classification. It takes features and the multi-label prediction probability vector

as input, and uses softmax activation function to produce probabilistic prediction output. For the conditional adversarial training, we adopted a focal loss  

[21, 30], which uses the prediction confidence deficiency score to weight each instance in order to give more weights to hard-to-classify examples. The loss of conditional adversarial training, , is as below:


where is a modulation factor that controls how much to focus on the hard-to-classify example; the global features and the multi-label prediction probability vector are integrated through a multi-linear mapping function such that . With this adversary loss, the feature extractor will be adjusted to try to confuse the domain discriminator , while aims to maximumly separate the two domains.

This multi-label prediction conditioned adversarial feature alignment is expected to bridge the domain distribution gaps while preserving the discriminability for object recognition, which will improve the adaptation of the consequent region proposal, object classification on each proposed region and its location identification in the target domain.

3.3 Category Prediction based Regularization

The detection task involves recognizing both the objects and their locations, which is relatively more difficult than object recognition [7]. The multi-label classifiers we applied can produce more accurate recognition results as the region proposal mistakes can be accumulated to objection classification on the proposed regions in the detection task. Based on such an observation, we propose a novel category prediction consistency regularization mechanism for object detection, by exploiting multi-label prediction results.

Assume region proposals are generated through the region proposal network (RPN) for an input image . Each proposal will be classified into one of the object classes using an object classifier , while its location coordinates will be produced using a regressor . The multi-class object classifier produces a length prediction vector on each proposal that indicates the probability of the proposed region belonging to one of the object classes. The object prediction on the total proposals can form a prediction matrix . We can then compute an overall multi-object prediction probability vector by taking the row-wise maximum over , such that , and use as the prediction probability of the image containing the -th object category. To enforce consistency between the prediction produced by the detector and the prediction produced by the multi-label object recognition, we propose to minimize the divergence between their prediction probability vectors and after renormalizing each vector with softmax function. As divergence is an asymmetric measure, we define the consistency regularization loss as:


With this regularization loss, we expect the multi-label prediction results can assist object detection through unified mutual learning.

3.4 Overall End-to-End Learning

The detection loss of the base Faster-RCNN model, denoted as , is computed on the annotated source domain data under supervised classification and regression. It has two components, the proposal classification loss and the bounding box regression loss. We combine the detection loss, multi-label prediction loss, and the conditional adversarial feature alignment loss, and the prediction consistency regularization loss together for end-to-end deep learning. The total loss can be written as:


where , , and are trade-off parameters that balance the multiple loss terms. We use SGD optimization algorithm to perform training, while GRL [8] is adopted to implement the gradient sign flip for the domain discriminator part.

Method G I L MC CTX PR bike bird car cat dog person mAP
Source-only 68.8 46.8 37.2 32.7 21.3 60.7 44.6
BDC-Faster [30] 68.6 48.3 47.2 26.5 21.7 60.5 45.5
DA-Faster [1] 75.2 40.6 48.0 31.5 20.6 60.0 46.0
SW-DA [30] 82.3 55.9 46.5 32.7 35.5 66.7 53.3
MCAR (Ours) 92.5 52.2 43.9 46.5 28.8 62.5 54.4
87.9 52.1 51.8 41.6 33.8 68.8 56.0
Oracle 83.6 59.4 50.7 43.7 39.5 74.5 58.6
Table 1: Test results of domain adaptation for object detection from PASCAL VOC to Watercolor in terms of mean average precision (%). G, I, L, MC, CTX, and PR indicate global alignment, instance-stage alignment, local alignment, Multilabel-conditional adversary, context-vector based regularization, and Prediction based Regularization, respectively.
Method MC PR bike bird car cat dog person mAP
Source-only 32.5 12.0 21.1 10.4 12.4 29.9 19.7
DA-Faster 31.1 10.3 15.5 12.4 19.3 39.0 21.2
SW-DA 36.4 21.8 29.8 15.1 23.5 49.6 29.4
MCAR (Ours) 40.9 22.5 30.3 23.7 24.7 53.6 32.6
47.9 20.5 37.4 20.6 24.5 50.2 33.5

Table 2: Test results of domain adaptation for object detection from PASCAL VOC to Comic, The definition of MC and PR is same as in Table 1.

4 Experiments

We conducted experiments with multiple cross-domain multi-object detection tasks under different adaptation scenarios: (1) Domain adaptation from real to virtual image scenarios. We used cross-domain detection tasks from PASCAL VOC [7] to Watercolor2K [17] and Comic2K [17] respectively. (2) Domain adaption from normal/clear images to foggy image scenarios. We used object detection tasks that adapt from Cityscapes [4] to Foggy Cityscapes [31]. In each adaptive object detection task, the images in the source domain are fully annotated and the images in the target domain are entirely unannotated. We report our experimental results and discussions in this section.

4.1 Implementation Details

In the experiments, we followed the setting of  [30]

by using the Faster-RCNN as the backbone detection network, pretraining the model weights on the ImageNet, and using the same 600 pixels of images’ shortest side. We set the training epoch as 25, and set

, , , and as 0.5, 0.01, 0.1, and 5 respectively. The momentum is set as 0.9 and weight decay as 0.0005. For all experiments, we evaluated different methods using mean average precision (mAP) with a threshold of 0.5. By default, in the multi-label learning, we set the convolution kernel of the shared convolutional layer to 3x3, the channel to 512, the convolution kernel of the branch convolutional layer to 3x3, and the channel to 512. The convolution layer in conditional adversary learning has 3x3 convolution kernel and 512 channels. It is worth mentioning that the methods in  [17, 18] are pixel level domain adaptation, such as CycleGAN [42]. They require more computational resources and can be considered as data augmentation for many domain adaptation methods. We hence did not consider comparison in such settings.

Method MC PR person rider car truck bus train motorbike bicycle mAP
Source-only 25.1 32.7 31.0 12.5 23.9 9.1 23.7 29.1 23.4
BDC-Faster [30] 26.4 37.2 42.4 21.2 29.2 12.3 22.6 28.9 27.5
DA-Faster [1] 25.0 31.0 40.5 22.1 35.3 20.2 20.0 27.1 27.6
SC-DA [43] 33.5 38.0 48.5 26.5 39.0 23.3 28.0 33.6 33.8
SW-DA [30] 36.2 35.3 43.5 30.0 29.9 42.3 32.6 24.5 34.3
Dense-DA [38] 33.2 44.2 44.8 28.2 41.8 28.7 30.5 36.5 36.0

MCAR (Ours)
31.2 42.5 43.8 32.3 41.1 33.0 32.4 36.5 36.6
32.0 42.1 43.9 31.3 44.1 43.4 37.4 36.6 38.8
Oracle 50.0 36.2 49.7 34.7 33.2 45.9 37.4 35.6 40.3
Table 3: Test results of domain adaptation for object detection from Cityscapes to Foggy Cityscapes in terms of mAP (%). The definition of MC and PR is same as in Table 1.

4.2 Domain Adaptation from Real to Virtual Scenes

In this set of experiments, we used the PASCAL VOC [7] dataset as the source domain, and used the Watercolor2k and Comic2k [17] as the target domains. PASCAL VOC contains realistic images, while Watercolor2k and Comic2k contain virtual scene images. There are significant differences between the source and target domains. The training set of PASCAL VOC (Trainval of PASCAL VOC 2007 and PASCAL VOC 2012) includes 20 different object labels and a total of 16,551 images. Watercolor2k and Comic2k contain 6 different classes (’bicycle’, ’bird’, ’car’, ’cat’, ’ Dog’, ’person’), each providing 2K images, and splitting equally into training set and test set. These 6 categories are included in the 20 categories of PASCAL VOC. We used the 1K training set in each target domain for training the domain adaptation model, while evaluating the model and report results with the 1K test set. In this experiment, we used resnet101 [15] as the backbone network of the detection model.

PASCAL VOC to Watercolor. The test detection results yield by adaptation from PASCAL VOC to Watercolor are reported in Table 1. Our proposed MCAR model is compared with the source-only baseline and three state-of-the-art adaptive object detection methods, BDC-Faster [30], DA-Faster [1], and SW-DA [30]. The oracle results, obtained by training on labeled data in the target domain, are also provided as an upperbound reference value. We can see under the same experimental conditions, our proposed method achieves the best results among all comparison results. It outperforms the best comparison result by 2.7%, while only underpeforming the oracle by 2.6%. Comparing to source only, our method achieves a remarkable overall performance improvement of 9.8%. Although SW-DA [30] confirmed the validity of local and global feature alignment and showed a significant performance improvement over other methods, our method surpasses SW-DA by 1.1% even with only multi-label based global feature alignment. Our full approach outperforms SW-DA by 2.7%. The results suggest the proposed multi-label learning based feature alignment and prediction regularization are effective.

PASCAL VOC to Comic. The results of adaptation from PASCAL VOC to Comic are reported in Table 2. Again, the proposed MCAR method achieved the best adaptive detection result. It outperforms the baseline, source-only (trained on source domain data without any adaptation), by 13.8%, and outperforms the best comparison method, SW-DA, by 4.1%, These results again show that our model is very suitable for adaptive multi-object detection.

4.3 Adaptation from Clear to Foggy Scenes.

In this experiments, we perform adaptive object detection from normal clear images to foggy images. We use the Cityscapes dataset as the source domain. Its images came from 27 different urban scenes, where the annotated bounding boxes are generated by the original pixel annotations. We use the Foggy Cityscapes dataset as the target domain. Its images have been rendered by Cityscapes, which can simulate fog in real road conditions with deep rendering. They contain 8 categories: ’person’, ’rider’, ’car’, ’truck’, ’bus’, ’train’, ’motorcycle’ and ’bicycle’. In this experiment, we used vgg16 [33] as the backbone of the detection model. We recorded the test results on the validation set of Foggy Cityscapes.

The results are reported in the Table 3. We can see the proposed MCAR method achieved the best adaptive detection result. It outperforms source-only by 15.4%, and outperforms the best comparison method, Dense-DA [38], by 2.8%. Moreover, it is worth noting that the performance of the proposed approach is very close to the oracle; the result of the oracle is only 1.5% higher than ours. Due to the very complex road conditions in this task, although the multi-label classifier is more capable of category judgment than the detection model, its accuracy is relatively low, and the difference between the two is not much, as shown in Table 4. Hence in this experiment, we used the combination of the multi-label category prediction and the object detection level category prediction. That is, we used as the label category information for the conditional adversarial feature alignment. This experiment presents and validates a natural variant of the proposed model.

Figure 3: Feature visualization results. (a) and (b) respectively represent the feature distribution results of the Source-only model and our model in the clear (Cityscapes) and foggy (Foggy Cityscapes) scenes. Red indicates from the source domain and blue indicates from the target domain
Method Accuracy(Watercolor )
Multi-label classification 91.9
Basic detection classification 89.1
Method Accuracy(Foggy Cityspaes )
Multi-label classification 79.2
Basic detection classification 78.5
Table 4: Object classification accuracy: Training on PASCAL VOC and Cityscapes, and testing on Watercolor and Foggy Cityscapes, respectively.
Figure 4: Qualitative results on adaptive detection. The top row presents examples of domain adaptive detection from PASCAL VOC to Watercolor. The bottom row shows examples of adaptive detection from Cityscapes to Foggy Cityscapes. The green box represents the results obtained by the detection models, and the blue box represents the ground-truth annotation.
1 3 5 7 9
mAP 44.0 46.1 54.4 49.1 44.8
0.1 0.25 0.5 0.75 1
mAP 49.1 50.2 54.4 50.1 49.3
Table 5: Parameter sensitivity analysis on task of adaptation from PASCAL VOC to watercolor.

4.4 Further Analysis

Classification accuracy of multi-label learning. Our proposed multi-label assisted model produced good adaptive object detection result. But our experiments also suggest possibility of incorporating the classification of multi-label learning together with the detection level prediction, under our proposed framework. In this experiment, we compare test classification accuracy of the multi-label classifiers and the detection model based classification. We used the PASCAL VOC and Cityscapes as training sets to train Faster-RCNN and multi-label binary classifiers in a fully supervised manner. We then evaluated the test accuracy of multi-label classifiers and the final category classification results of the basic detection network on Watercolor and Foggy Cityscapes respectively. The results are reported in Table 4. We can see the accuracy of direct acquisition of feature category information by simple binary multi-label classifiers is higher than the detection model.

Feature visualization. On the task of adaptation from Cityscapes to Foggy Cityscapes, we used t-SNE [25] to compare the distribution of induced features between our model and the Source-only model (clear to fogg scenes). The results are shown in Figure 3. We can see that with the feature distribution obtained by source-only (Figure 3(a)), the source domain and target domain are obviously separated, which shows the existence of domain divergence. However, our method produced features that can well confuse the domain discriminators. This suggests that our proposed model has the capacity to bridge the domain distribution divergence and induce domain invariant features.

Parameters sensitivity analysis.

We conducted sensitivity analysis on the two hyperparameters,

and using the adaption task from PASCAL VOC to Watercolor. controls the weight of adversarial feature alignment, while controls the degree of focusing on hard-to-classify examples. Other hyperparameters are set to their default values. We conducted the experiment by fixing the value of to adjust , and then fixing to adjust . Table 5 presents the results. We can see with the decrease of parameter from its default value 5, the test performance degrades as the influence of domain classifier on difficult samples is weakened and the contribution of easy samples is increased. When , it leads to the same result as the basic model, suggesting the domain regulation ability basically fails to play its role. On the other hand, a very large value is not good either, as the most difficult samples will dominate. For , we find that leads to the best performance. As detection is still the main task, it makes sense to have the . When , it degrades to a basic model without feature alignment. Therefore, some value in the middle would be a proper choice.

Qualitative results. Object detection results are suitable to be qualitatively judge through visualization. Hence we present some qualitative adaptive detection results in the target domain in Figure 4. The top row of Figure 4 presents the qualitative detection result of three state-of-the-art adaptive detection methods, DA-Faster, SW-DA, MCAR (ours), and the ground-truth on an image from Watercolor. We can see both ’DA-Faster’ and ’SW-DA’ have some false positives, while failing to detect the object of ’dog’. Our model correctly detected both the ‘person’ and the ‘dog’. The bottom row of Figure 4 presents the detection results of the DA methods and the ground-truth on an image from Foggy Cityscapes. We can see it is obvious that the cars in the distance are very blurred and difficult to detect due to the fog. The DA-Faster and SW-DA fail to find these cars, while our model successfully detected them.

5 Conclusion

In this paper, we propose an unsupervised multi-object cross-domain detection method. We exploit multi-label object recognition as a dual auxiliary task to reveal the category information of images from the global features. The cross-domain feature alignment is conducted by performing conditional adversarial distribution alignment with the combination input of global feature and category information. We also use the idea of mutual learning to improve the detection performance by enforcing consistent object category predictions between the multi-label prediction over global features and the object classification over detection region proposals. We conducted experiments on multiple cross-domain multi-objective detection datasets. The results show the proposed model achieved the state-of-the-art performance.


  • [1] Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool (2018) Domain adaptive faster r-cnn for object detection in the wild. In CVPR, Cited by: §1, §2, Table 1, §4.2, Table 3.
  • [2] J. Choi, T. Kim, and C. Kim (2019) Self-ensembling with gan-based data augmentation for domain adaptation in semantic segmentation. In ICCV, Cited by: §2.
  • [3] S. Cicek and S. Soatto (2019) Unsupervised domain adaptation via regularized conditional alignment. arXiv preprint arXiv:1905.10885. Cited by: §1, §1.
  • [4] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    In CVPR, Cited by: §4.
  • [5] J. Dai, Y. Li, K. He, and J. Sun (2016) R-fcn: object detection via region-based fully convolutional networks. In NIPS, Cited by: §2.
  • [6] G. K. Dziugaite, D. M. Roy, and Z. Ghahramani (2015) Training generative neural networks via maximum mean discrepancy optimization. In UAI, Cited by: §2.
  • [7] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. IJCV. Cited by: §3.3, §4.2, §4.
  • [8] Y. Ganin and V. Lempitsky (2015)

    Unsupervised domain adaptation by backpropagation

    In ICML, Cited by: §2, §3.4.
  • [9] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks. JMLR. Cited by: §1, §1, §2.
  • [10] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, Cited by: §2.
  • [11] R. Girshick (2015) Fast r-cnn. In ICCV, Cited by: §1, §2.
  • [12] Y. Gong, Y. Jia, T. Leung, A. Toshev, and S. Ioffe (2014) Deep convolutional ranking for multilabel image annotation. ICLR. Cited by: §3.1.
  • [13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NIPS, Cited by: §2, §3.2.
  • [14] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In ICCV, Cited by: §2.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §1, §4.2.
  • [16] Z. He and L. Zhang (2019) Multi-adversarial faster-rcnn for unrestricted object detection. In ICCV, Cited by: §2.
  • [17] N. Inoue, R. Furuta, T. Yamasaki, and K. Aizawa (2018) Cross-domain weakly-supervised object detection through progressive domain adaptation. In CVPR, Cited by: §2, §4.1, §4.2, §4.
  • [18] T. Kim, M. Jeong, S. Kim, S. Choi, and C. Kim (2019) Diversify and match: a domain adaptive representation learning paradigm for object detection. In CVPR, Cited by: §2, §4.1.
  • [19] B. Kulis, K. Saenko, and T. Darrell (2011) What you saw is not what you get: domain adaptation using asymmetric kernel transforms. In CVPR, Cited by: §2.
  • [20] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In CVPR, Cited by: §2.
  • [21] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In ICCV, Cited by: §3.2.
  • [22] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In ECCV, Cited by: §1, §2.
  • [23] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In CVPR, Cited by: §1.
  • [24] M. Long, Z. Cao, J. Wang, and M. I. Jordan (2018) Conditional adversarial domain adaptation. In NIPS, Cited by: §1, §1, §2, §3.2.
  • [25] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. JMLR. Cited by: §4.4.
  • [26] M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §3.2.
  • [27] J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §1, §2.
  • [28] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NIPS, Cited by: §1, §2.
  • [29] P. Russo, F. M. Carlucci, T. Tommasi, and B. Caputo (2018) From source to target and back: symmetric bi-directional adaptive gan. In CVPR, Cited by: §2.
  • [30] K. Saito, Y. Ushiku, T. Harada, and K. Saenko (2019) Strong-weak distribution alignment for adaptive object detection. In CVPR, Cited by: §1, §2, §3.2, Table 1, §4.1, §4.2, Table 3.
  • [31] C. Sakaridis, D. Dai, and L. Van Gool (2018) Semantic foggy scene understanding with synthetic data. IJCV. Cited by: §4.
  • [32] J. Shen, Y. Qu, W. Zhang, and Y. Yu (2018) Wasserstein distance guided representation learning for domain adaptation. In AAAI, Cited by: §2.
  • [33] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1, §4.3.
  • [34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In CVPR, Cited by: §1.
  • [35] Y. Tsai, W. Hung, S. Schulter, K. Sohn, M. Yang, and M. Chandraker (2018) Learning to adapt structured output space for semantic segmentation. In CVPR, Cited by: §1.
  • [36] Y. Tsai, K. Sohn, S. Schulter, and M. Chandraker (2019) Domain adaptation for structured output via discriminative representations. In ICCV, Cited by: §1.
  • [37] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell (2017) Adversarial discriminative domain adaptation. In CVPR, Cited by: §1.
  • [38] R. Xie, F. Yu, J. Wang, Y. Wang, and L. Zhang (2019) Multi-level domain adaptive learning for cross-domain detection. In ICCV, Cited by: §2, §4.3, Table 3.
  • [39] M. Zhang and Z. Zhou (2006) Multilabel neural networks with applications to functional genomics and text categorization. TKDE. Cited by: §3.1.
  • [40] Y. Zhang, P. David, and B. Gong (2017) Curriculum domain adaptation for semantic segmentation of urban scenes. In ICCV, Cited by: §1.
  • [41] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In CVPR, Cited by: §1.
  • [42] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017)

    Unpaired image-to-image translation using cycle-consistent adversarial networks

    In ICCV, Cited by: §4.1.
  • [43] X. Zhu, J. Pang, C. Yang, J. Shi, and D. Lin (2019) Adapting object detectors via selective cross-domain alignment. In CVPR, Cited by: §2, Table 3.