Object detection is a fundamental problem in computer vision, aiming for precise localization and classification of objects in images. In the past few years, numerous object detection models[ref2, ref3, ref4, ref5, ref6]
based on convolutional neural networks (CNNs)[ref1] have successfully improved the performance using a large amount of labeled data. Nevertheless, applying off-the-shelf pre-trained detectors to detect objects in real-world scenarios inevitably leads to a significant performance drop due to the large domain gap including object appearance, image scale, backgrounds, illumination, viewpoints, and image quality, etc. To meet this challenge, researchers have explored domain adaptation [ref15] to transfer a detector learned from a labeled source domain to an unlabeled target domain with different scenarios, which is named domain adaptive object detection (DAOD).
Domain adaptive Faster RCNN (DAF) [ref8] is the most representative DAOD work that integrates Faster R-CNN [ref4] with adversarial training. To address the domain shift problem, it aligns both the image-level and instance-level distributions across domains with adversarial training. Subsequently, with the structural characteristics of detection tasks, DAF has rapidly evolved into a successful baseline [ref9, ref10, ref11, ref12, ref13, ref14, ref43, ref44, ref45, ref46]. These methods successfully improve the performance of the detector on the target domain under the ideal and prior assumption that the label spaces are identical across domains (i.e., Closed-set).
Nonetheless, existing methods overlook the fact that there is NO prior knowledge about the target domain categories in the real-world scenarios. Hence, as shown in Figure 1, we consider a new realistic setting called Universal Domain Adaptive Object Detection (UniDAOD). For a better illustration, we denote and as the label set of source and target domains, respectively. According to the relationship of label sets between the source and target domains, the universal scenarios fall into partial-set (), open-set ( or , , ), and closed-set () scenarios. Thus, the main task for UniDAOD is to recognize the common classes (i.e., classes shared across domains) and eliminate the domain gap, by simultaneously suppressing the interference of private classes (i.e., classes only exist in one domain). Besides, Figure 1
shows that the distributions of features extracted from objects at different scales can be very different due to the perspective projection effect,e.g., cars that are far away are usually very small in an image, while the near ones are relatively larger. Thus, a uniform feature alignment across all scales, as previous DAOD methods did, may not be sufficient. Instead it is more feasible to perform individual alignment on each scale between domains.
Generally, the inherent challenges come from two aspects for UniDAOD. (1) category shift challenge: the label set of test data may not be the same as that of training data, and the private classes may lead to negative transfer due to its absence in another domain. (2) diverse scales challenge:
previous DAOD methods mainly explored the category adaptation while ignoring a crucial challenge caused by the large variance in object scales, which is difficult but important for detection performance, especially for the UniDAOD task.
To overcome these challenges, we propose an end-to-end deep universal domain adaptation framework, US-DAF, namely Universal Scale-Aware Domain Adaptive Faster RCNN with Multi-Label Learning. Specifically, since adversarial alignment on features of all classes without separation might hurt its discriminability (category shift challenge), we propose a Filter Mechanism to suppress the private classes and preserve the common classes during the adversarial training. To fill the blank of scale-aware adaptation in cross-domain object detection (diverse scale challenge), we introduce a new Multi-Label Scale-Aware adapter to perform individual alignment between corresponding scale for two domains (i.e., aligning small objects to small ones, medium objects to medium ones, and large objects to large ones).
The main contributions of this paper can be summarized as the following four-fold: (1) We first introduce a more practical Universal Domain Adaptive Object Detection (UniDAOD) protocol, which is accompanied with a novel Universal Scale-Aware Domain Adaptive Faster RCNN (US-DAF) framework. (2) To alleviate the impact of negative transfer caused by category shift, we propose a Filter Mechanism to reject the private classes and preserve the common classes during adversarial training on the image-level alignment and instance-level alignment. (3) To tackle the problem caused by the large variation of object scales in natural scenes, we propose a new Multi-Label Scale-Aware adapter, which can leverage the scale information for better feature alignment. (4) Through ablation studies and experiments, we show that our USAF achieves state-of-the-art performance and also contributes a potential baseline under this pretty new task.
2 Related Work and Preliminaries
Universal Domain Adaptation: Existing domain adaptation methods for classification [ref16, ref17, ref18, ref19, ref20] generally assume that the source and target domain share identical label space. However, in real applications, it is not practical to find a source domain having the same label space as the target domain due to the diversity of detection categories. Therefore, Cao et al. [ref21]
introduce the Partial Domain Adaption problem which assumes that the target label space is a subset of the source label space, and present Partial Adversarial Domain Adaptation (PADA) by down-weighing the data of outlier source classes to alleviates negative transfer. Bustoet al. [ref22] propose the Open Set Domain Adaption scene in which there is an intersection between the source and the target domain label spaces. You et al. [ref23] propose Universal Adaptation Network (UAN), equipped with a novel criterion to quantify the transferability of each sample under the generalized Universal Domain Adaptation setting that requires no prior knowledge about the label space between domains. Fu et al. [ref24]
propose Calibrated Multiple Uncertainties (CMU) with a novel transferability measure estimated by a mixture of uncertainty quantities to align target features with source features. However, directly applying these methods to object detection yields an unsatisfactory effect. The difficulty is that the image of object detection usually contains multiple objects, thus the features of an image can have complex multi-modal structures.
Domain Adaptive Object Detection: Domain adaptive object detection (DAOD) task has drawn a lot of attention due to its various applications [ref8, ref9, ref10, ref11, ref12, ref13, ref14]. As a pioneering work, Chen et al. [ref8] propose the domain adaptive Faster-RCNN method (DAF), which achieves image-level and instance-level feature alignment by using adversarial gradient reversal. At the same time, it is pointed out that the core issue of DAOD is to solve the domain gaps in image level and instance level. Formally, let denote that the feature is from the source domain while denote that the feature is from the target domain. For the image-level alignment, let
denote the output of the image-level domain classifier for the activation located atof the feature map, then the image-level alignment loss can be written as:
For the instance-level alignment, let denote the output of the instance-level domain classifier for the -th region proposal, then the instance-level alignment loss is as follows:
After that, a large number of excellent detection algorithms emerge to overcome the image-level and instance-level domain adaption problems. Specifically, Saito et al. [ref10] utilize strong and weak domain classifiers to align local and global features. He and Zhang [ref9] propose a hierarchical alignment network that is designed to align features at different scales between the source domain and the target domain. He et al. [ref26] introduce an asymmetric tri-way approach to account for the differences in labeling statistics between domains. Chen et al. [ref27] utilize CycleGAN as a method of data augmentation to generate intermediate domain images between the source domain and the target domain to make model easy to align. Zhao et al. [ref13] use multi-label classification as an auxiliary task to regularize the features.
However, most of the DAOD approaches have overlooked two fundamental yet practical issues: 1) All the previous methods rely on an inherent assumption that different domains have identical label space, which greatly limits their generalization in the wild. 2) They mainly explore category adaptation and ignore the crucial challenge caused by the large variance in object scales. In this paper, we are working on solving the above two problems from two aspects: 1) Our model considers a universal setting that imposes no prior knowledge on the label sets and proposes a filter mechanism to suppress private classes. 2) Our model employs a reliable multi-label scale-aware adapter, which can leverage the scale information for better feature alignment to bridge the domain gap caused by the scale shift.
In UniDAOD, we assume that a source domain of labeled samples from distribution and a target domain of unlabeled samples from distribution are provided at training. Since the label set may not be identical, we use , to denote the label set of source and target domains, respectively. is the common label set shared by both domains, while and
are the private label sets for source and target respectively. Note that the target label set is not accessible at training and only used for defining the UniDAOD problem. The Jaccard index of the label sets of the two domains,, is used to represent the overlap among classes.
3.1 Network Structure
To deal with the challenge (i.e., category shift and scale shift) mentioned in Section 1, we propose the Universal Scale-Aware Domain Adaptive Faster RCNN with Multi-Label Learning (US-DAF) framework, which has two steps: (1) suppresses the private classes and preserves the common classes at the image level and the instance level. (2) designs the multi-label scale-aware adapter at the image level and instance level to tackle the problem brought by the variation of object scales in natural scenes.
Since the private-class features might lead to negative transfer during the training to hurt the discriminability of the detector, we filter out the private classes and focus on the common classes in adapting an object detector by introducing a Filter Mechanism. To fill the blank of scale-aware adaptation in cross-domain object detection, we need to perform adaptation on the bounding box scale. The overall structure of the proposed US-DAF is presented in Figure 2, and Section 3.2 and 3.3 will introduce the design of the filter mechanism and the scale-aware adapter in details.
3.2 Filter Mechanism
An ideal solution for the category shift of UniDAOD is to make the samples with common categories go through for further adaptation while suppressing the samples of private categories. If we naively pick any of the existing DAOD methods to solve the UniDAOD by aligning the source with the target domain, the private classes will impose negative transfer and degrade the detection performance of common classes in the target domain. Therefore, we adopt a sample-level Filter Mechanism. For both source and target domains, the samples with the common categories are expected to become well-aligned while the samples of private categories are expected to be ignored. Consequently, we need a criterion to explore the common category set and private category set, and then perform the adversarial domain alignment with this criterion.
Our motivation is from the observation on the optimization process with Gradient Reverse Layer (GRL) [ref17]. Specifically, the objective of domain discriminator is to predict samples from source domain as 0 and samples from target domain as 1. The ideal convergence point of the domain adversarial training is that the samples with similar categories cannot be easily distinguished, which means the predictions from domain discriminator on these samples are around the middle point 0.5. Thus, can be seen as the quantification for the domain similarity of each sample. For a source sample , larger means that it is more similar to the target domain; for a target sample , smaller means that it is more similar to the source domain. Therefore, we can hypothesize that < < <.
Inspired by this, we propose to draw a boundary between common and private points using the predictions of the domain discriminator. We visually introduce the idea in Figure 3. Specifically, the distance between the prediction and middle point, 0.5, is defined as , where is the classification output for a sample . We expect that the prediction of common-class samples is closer to the middle point than the private-class ones. Therefore, we propose to introduce a confidence threshold parameter to explore the common category set and private category set. The above formulation shows that common-class and private-class samples can be separated with the confidence threshold parameter . Note that tuning the parameter for each adaptation setting requires a validation set.
With the above analysis, by combining Filter Mechanism with image-level and instance-level alignments, the sample-level transferability criterion for the image-level domain adaptation (i.e., Eq. 1) and the instance-level domain adaptation (i.e., Eq. 2) can be respectively re-formulated as Eq. 3 and Eq. 4:
The introduction of the confidence threshold allows us to give the final separation loss for differentiating the common-class samples from private-class samples.
3.3 Scale-Aware Adaptation with Multi-Label Learning
We design a scale-aware adaptation (SAA) module to leverage the scale information for better feature alignment at image level and instance level. Our motivation lies in two aspects. First, Chen [ref32] and Lin [ref33] claim that the scale of objects in natural images can vary dramatically, which is an inevitable and non-negligible problem in image segmentation. We can therefore make a reasonable assumption that the large variance in object scales often brings a crucial challenge to cross-domain object detection. Second, as discussed in Section 1, current DAOD models [ref8, ref9, ref10, ref26, ref27, ref13] ignore the importance of scale-aware alignment and have a uniform feature alignment across all scales, which may not be sufficient (see Figure 4 for illustration).
Building upon those above considerations, we introduce the scale-aware adaptation module to perform alignment between the corresponding scale for two domains. Intuitively, the size of instance features can be divided into three categories: small ( pixels), medium ( pixels), and large ( pixels). The size of the instance is cost-free for detection datasets, and can be easily acquired through the sub-module RPN of Faster RCNN. It is worth noting that, as illustrated in Figure 2, we attach the scale-aware adaptation with the instance level and image level, due to the image level features contain fine-grained information associated with the objects in the instance level.
In particular, we treat the scale-aware domain classification task as a multi-label classification problem [ref34, ref35]. It takes the size of the instance features produced by the RPN of the Faster-RCNN model as additional label input, and combines it with the original domain label. More formally, as shown in Figure 4, we define by the multi-label of a training image, and encode , , and for the source domain with three different scales and , , and for the target domain with three different scales. It is worth noting that the first entry of the encoded multi-label is used to indicate the domain and the last three entries indicate the scale (small, medium, large) of objects. The domain-invariant features can then be learned by minimizing the following multi-label cross-entropy loss:
With the above analysis, by combining the filter mechanism(i.e., Eq.3 and Eq.4) with SAA , the sample-level transferability criterion for the image-level multi-scale domain adaptation (i.e., Eq.5) and the instance-level multi-scale domain adaptation (i.e., Eq.6) can be respectively defined as Eq.7 and Eq.8:
3.4 Overall End-to-End Learning
The overall framework of US-DAF with a detailed pipeline can be observed in Figure 2
. US-DAF contains three loss functions, including the detection loss, image-level domain adversarial loss and instance-level domain adversarial loss . The standard detection loss in Faster-RCNN [ref4] is used, i.e., the cross-entropy loss is used for classification and the SmoothL1 loss is used for regression (localization). Note that the detection loss is only optimized on the labeled source samples.
The combination of the last two losses formulates the proposed universal domain alignment (UniDA) of US-DAF in both image-level and instance-level. By jointly considering Eq.7 and Eq.8, the proposed UniDA loss is expressed as:
With the combination of the detection loss and domain alignment loss, the final loss of the proposed US-DAF can be written as:
where is a hyper-parameter, G denotes a Faster R-CNN object detector, and D indicates the domain classifier. The mini-max adversarial optimization is implemented by the GRL [ref17].
To perform a thorough evaluation under a variety of UniDAOD settings, we compare US-DAF with state of the art methods tailored to DAOD settings on several datasets with different scenarios, i.e., open-set, partial-set, and closed-set
. We conduct sufficient experiments and evaluate our proposed method on benchmark datasets, including Cityscapes[ref36], Foggy Cityscapes [ref37], PASCAL VOC [ref7], Clipart1k and WaterColor [ref40]. Then, we explore the performance with respect to the change of . Code will be available.
4.1 Experimental Setup
Implementation Details. For fair comparison, the backbone network of our proposed US-DAF model is ResNet101 [ref41]
pre-trained on ImageNet[ref1] in the experiments. Following the default settings in [ref8]
, the shorter side of each input image is resized to 600 pixels. We optimize the network by using the stochastic gradient descent (SGD) optimizer with a momentum of 0.9 and a weight decay of 0.0005. The initial learning rate is set to 0.001 and dropped to 0.0001 after 50k iterations. Totally, 100k iterations are trained. The trade-off parameterin Eq.10 is set as 0.01 in our implementation. A single batch is composed of two images respectively for the source and target domains. To evaluate the adaptation performance, we report mean average precision (mAP) with IOU threshold of 0.5.
Compared Methods. We compare the proposed US-DAF with (1) CNN based object detection: Source only Faster RCNN [ref4] without any adaption, (2) Traditional domain adaptive object detection methods: Domain Adaptive Faster R-CNN (DAF) [ref8], Multi-adversarial Faster-RCNN (MAF) [ref9], and Hierarchical Transferability Calibration Network (HTCN) [ref27], (3) Partial domain adaptation methods: Partial Adversarial Domain Adaptation (PADA) [ref21], (4) Universal domain adaptation methods: Universal Adaptation Network (UAN) [ref23], Calibrated Multiple Uncertainties (CMU) [ref24]. Because these methods achieved state-of-the-art performance in their respective task, it is valuable to show their performance in the UniDAOD setting. It is worth noting that PADA, UAN, and CMU are domain adaptation methods for image classification, and we use them in domain adaptation object detection.
4.2 Experimental Results
In the case of different , with respect to open-set, partial-set, and closed-set, the mean average precision of the common classes are shown in Tables 1 to 6. US-DAF outperforms all the compared methods in terms of the mean average precision. These consistent results suggest that US-DAF can overcome the double challenge brought by category shift and scale issue between the source and target domains. We have the following observations.
Open-set scenario. As shown in Tables 1 to 3, we use the PASCAL VOC [ref7] as the source domain and the Clipart1k [ref40] as the target domain, and we select some classes as the common classes or private classes. Specifically, we design three experiments for this scenario (from PASCAL VOC to Clipart1k) with different .
The experiments show that whatever this is, our US-DAF can achieve state-of-the-art results among all compared methods. The proposed US-DAF clearly outperforms the baseline model DAF [ref8] by +7.1%, +6.8%, and +3.6% with different . Note that our US-DAF also can surpass the MAF [ref9] and HTCN [ref27], even if they have a multi-layer alignment structure and additional adaptation modules. And both source and target domains have their own private classes in these scenarios, which lead to more serious negative transfer. However, our model still performs well in these scenarios by using the proposed filter mechanism and scale-aware adaptation.
Furthermore, in open-set settings, especially the difficult task Watercolor [ref30] PASCAL VOC [ref7] (i.e., Table 4), most existing methods perform similarly to or even worse than Faster RCNN, indicating that existing methods are prone to negative transfer in open-set settings. That is, they perform worse than a model only trained on source data without any adaptation. We can find that DAF [ref8], MAF [ref9], and HTCN [ref27] suffer from negative transfer in most classes and are only able to promote the adaptation for a few classes. Comparatively, the proposed US-DAF promotes positive transfer for all classes.
Partial-set scenario. We conduct the partial domain adaptive object detection scenario, in which the target label set is completely a subset of the source label set (). WaterColor [ref40] dataset contains 6 categories in common with PASCAL VOC [ref7]. Therefore, we adopt the PASCAL VOC as the source domain and the WaterColor as the target domain in the partial domain adaptation.
The results are presented in Table 5. We can see that our US-DAF achieves 55.2% mAP, which outperforms a remarkable increase of +5.9% over the baseline DAF [ref8]. Furthermore, we can observe that most existing DAOD methods perform similarly to or even worse than Faster RCNN, indicating that existing methods are prone to negative transfer in partial-set settings. That is, they perform worse than a model only trained on source data without any adaptation. Note that our US-DAF also can surpass the UAN [ref23] and CMU [ref24], even if they avoid negative transfer in most tasks. Comparatively, the proposed US-DAF promotes positive transfer for all classes. These consistent results suggest that US-DAF can overcome the double challenge brought by category shift and scale issue between the source and target domains.
Closed-set scenario. Existing DAOD methods work under the closed-set domain adaptation setting, where the category sets of the source and target domains are the same. Therefore, we use the samples from common label set to compare our methods with previous methods. As shown in Table 6, we conduct the experiment on the closed set () from Cityscapes [ref36] to Foggy Cityscapes [ref37] by comparing the two baseline methods [ref8, ref9].
Experimental result shows that our proposed US-DAF outperforms the two methods, which significantly indicates that our the sample-level transferability criterion filter mechanism of US-DAF does not deteriorate performance on the closed set domain adaptation setting, and demonstrates the effectiveness of our scale-aware adaptation approach on the closed-set domain adaptation scenario.
4.3 Further Empirical Analysis
In this section, we conduct model analysis and discussion to investigate the effect of our US-DAF for the UniDAOD task. An in-depth insight into the proposed models is shown.
Ablation Study. We conduct the ablation study to show the effectiveness of each component (i.e., FM, SAA) by evaluating several variants of US-DAF and the results are reported at the bottom part of Tables 1 to 5 in all scenarios. We can see that the proposed filter mechanism (FM) is designed reasonably and when it is removed, the performance drops accordingly. Take Pascal VOC Clipart1k () (i.e., Table 2) as an example, with FM, its mean average precision is 40.0%, however, if without FM, its accuracy drops to 38.4%. Similarly, the results from Tables 1 to 5 also show that removing the SAA can make the performance correspondingly degrade. This indicates that the SAA module in the US-DAF is designed reasonably.
Negative Transfer. In the practical setting of UniDAOD, most existing methods perform similarly to or even worse than Faster-RCNN without any adaptation, indicating that existing methods are prone to negative transfer in UniDAOD settings. For example, Figure 5 (a) and (b) show the per-class accuracy gain compared to Faster RCNN on the tasks Pascal VOC Waterclolor and Waterclolor Pascal VOC. We can find that DAF, MAF, and HTCN suffer from negative transfer in most classes. Only US-DAF promotes positive transfer for all classes. This suggests that our proposed US-DAF has the capacity to quantify the class importance and intensify the common label set across domains.
Visualization of Feature Distribution. In Figure 6, we used t-SNE [ref42] to compare the distribution of induced features between our US-DAF and other models on the Patrial-set (i.e., Pascal VOC to Watercolor) and Open-set (i.e., Watercolor to Pascal VOC) scenarios, where different color stands for different common categories and the black dots stand for the private categories. We can observe that features of private classes and several common classes are close or even mixed together, indicating that DAF and UAN cannot discriminate known (common) and unknown (private) classes during training. By contrast, our proposed US-DAF produces features that can well separate the common and private classes, which benefits from the proposed strategy of filter mechanism and multi-label scale-aware adaptation.
Scale-Wise Analysis. In order to gain further insight into the influence of feature alignment, we conduct a scale-wise analysis, where we visualize the instance-level feature from different object scales, as well as provide a scale-wise quantitative evaluation. As shown in Figure 7, we use Cityscapes Foggy Cityscapes in this study. On the top row we show the domain-wise alignment. And on bottom row we show the scale-wise alignment, where we divide each instance into three sub-categories, based on the instance size: small ( pixels), medium ( pixels), large ( pixels).
Figure 7 presents the results of the source-only Faster RCNN, DAF, and DAF+SAA. From the results, the source and target features extracted from the source-only Faster R-CNN model can be clearly divided into two parts, and features of different scales are spanned across the feature space. DAF performs a uniform domain alignment, which is agnostic to the scale. As a result, the features are aligned between the two domains to some extent. However, the alignment produces a side effect of wrongly aligned features across different scales. In contrast, our DAF+SAA is able to take advantage of the scale information, and maintains the scale discriminability when aligning the features. This has resulted an observable better feature alignment. Further, We report mAP for each scale and summarize the results in Figure 7. We observe that the proposed modules also demonstrate better quantitative results across scales.
In this paper, we introduce a novel setting that better meets the needs of real-world scenarios, Universal Domain Adaptive Object Detection (UniDAOD), which requires no prior knowledge on the label set of target domains. In order to meet this challenge of UniDAOD, we contribute a Universal Scale-Aware Domain Adaptive Faster R-CNN with Multi-Label Learning (US-DAF) framework, which, to the best of our knowledge, is a pioneer work for object detection under both category shift and scale issue toward universal scenarios. In order to overcome the category shift of conventional UniDAOD, we introduce the filter mechanism to reject the private classes and preserve the common classes. Moreover, the scale-aware adapter is proposed with multi-label learning mechanism to tackle the problem caused by the large variety of scales in natural scenes. Through extensive experiments, we validated the effectiveness of our method by achieving a new state-of-the-art performance in various universal domain adaptation scenarios.
This work was partially supported by National Key R&D Program of China (2021YFB3100800), Chongqing Natural Science Fund (cstc2021jcyj-jqX0023), CCF Hikvision Open Fund (CCF-HIKVISION OF 20210002), CAAI-Huawei MindSpore Open Fund, and Beijing Academy of Artificial Intelligence (BAAI).