The object detection task has aroused great interest due to its wide applications. In the past few years, the development of deep neural networks has boosted the performance of object detectors(detection; RCNN; yolo). While these detectors have achieved excellent performance on the benchmark datasets (pascal; coco)
, object detection in the real world still faces challenges from the large variance in viewpoints, object appearance, backgrounds, illumination, image quality,etc. Such domain shifts have been observed to cause significant performance drop (wild). Thus, some work uses domain adaptation (transfer_survey) to transfer a detector from a source domain, where sufficient training data is available, to a target domain where only unlabeled data is available (wild; swda). This technique successfully improves the performance of the detector on the target domain. However, the improvement of domain adaptation in object detection remains relatively mild compared with that in object classification.
The inherent challenges come from three aspects. Data challenge: what to adapt in the adversarial training is unknown. Instance feature adaptation in the object level (Figure 1(a)) might confuse the features of the foreground and the background since the generated proposals may not be true objects and many true objects might be missing (see error analysis in Figure 5). Global feature adaptation in the image level (Figure 1(b)) is likely to mix up features of different objects since each input image of detection has multiple objects and their numerous combinations. Local feature adaptation in the pixel level (Figure 1(c)) can alleviate domain shift when the shift is primarily low-level, yet it will struggle when the domains are different at the semantic level. Architecture challenge: while the above adaptation methods encourage domain-invariant features, the discriminability of features might get deteriorated in the adaptation process (bsp; harmonizing)
. However, discriminability is very important for the detector, since it needs to locate and classify objects at the same time. While self-trainingself-iccv does not hurt the discriminability of features, they suffer from confirmation bias, especially when the domain shift is large. Loss challenge: previous methods mainly explored the category adaptation and ignored the regression adaptation, which is difficult but important for detection performance.
To overcome these challenges, we propose a general framework – D-adapt, namely Decoupled Adaptation. Since adversarial alignment directly on the features of the detector might hurt its discriminability (architecture challenge), we decouple the adversarial adaptation from the training of the detector by introducing a parameter-independent category adaptor (see Figure 1(d)). To fill the blank of regression domain adaptation in cross-domain detection (loss challenge), we introduce another bounding box adaptor that’s decoupled from both the detector and the category adaptor. To tackle the data challenge, we propose to adjust the object-level data distribution for specific adaptation tasks. This can be easily achieved by D-adapt, in which different adaptors can have completely different input data distributions. For example, in the category adaptation step, we encourage the input proposals to have IoU111The Intersection-over-Union between the proposals and the ground-truth instance. close to or to better satisfy the low-density separation assumption, while in the bounding box adaptation step, we encourage the input proposals to have IoU between and to ease the optimization of the bounding box localization task.
The contributions of this work are summarized as three-fold. (1) We introduce D-adapt framework for cross-domain object detection, which is general both two-stage and single-stage detectors. (2) We propose an effective method to adapt the bounding box localization task, which is ignored by existing methods but is crucial for achieving superior final performance. (3) We conduct extensive experiments and validate that our method achieves state-of-the-art performance on four object detection tasks, and yields 17% and 21% relative improvement on Clipart1k and Comic2k.
2 Related Work
Generic domain adaptation for classification.
Domain adaptation is proposed to overcome the distribution shift across domains. In the classification setting, most of the domain adaptation methods are based on Moment Matching or Adversarial Adaptation. Moment Matching methods(DDC; DAN)
align distributions by minimizing the distribution discrepancy in the feature space. Taking the same spirit as Generative Adversarial Networks(GAN), Adversarial Adaptation (DANN; CDAN) introduces a domain discriminator to distinguish the source from the target, then the feature extractor is encouraged to fool the discriminator and learn domain invariant features. However, directly applying these methods to object detection yields an unsatisfactory effect. The difficulty is that the image of object detection usually contains multiple objects, thus the features of an image can have complex multimodal structures (every-pixel; dual; harmonizing), making the image-level feature alignment problematic (dual; every-pixel).
Generic domain adaptation for regression.
Most domain adaptation methods designed for classification do not work well on regression tasks since the regression space is continuous with no clear decision boundary (RegDA). Some specific regression algorithms are proposed, including importance weighting (Yamada2013DomainAF) or learning invariant representations (5640675; NIKZADLANGERODI2020106447). RSD (DAR_ICML_21) defines a geometrical distance for learning transferable representations and disparity discrepancy (MDD) proposes an upper bound for the distribution distance in the regression problems. Yet previous methods are mainly tested on simple tasks while this paper extends domain adaptation to the object localization tasks.
Domain adaptation for object detection.
DA-Faster (wild) performs feature alignment at both image-level and instance-level. SWDA (swda) proposes that strong alignment of the local features is more effective than the strong alignment of the global features. Hsu et al. (every-pixel) carries out center-aware alignment by paying more attention to foreground pixels. HTCN (harmonizing) calibrates the transferability of feature representations hierarchically. Zheng et al. (coarse) proposes to extract foreground regions and adopts coarse-to-fine feature adaptation. ATF (triway) introduces an asymmetric tri-way approach to account for the differences in labeling statistics between domains. CRDA (exploring) and MCAR (dual) use multi-label classification as an auxiliary task to regularize the features. However, although the auxiliary task of outputting domain-invariant features to fool a domain discriminator in most aforementioned methods can improve the transferability, it also impairs the discriminability of the detector. In contrast, we decouple the adversarial adaptation and the training of the detector, thus the adaptors could specialize in transfer between domains, and the detector could focus on improving the discriminability while enjoying the transferability brought by the adaptors.
Self-training with pseudo labels.
Pseudo-labeling (self3), which leverages the model itself to obtain labels on unlabeled data, is widely used in self-training. To generate reliable pseudo labels, temporal ensembling (temporal) maintains an exponential moving average prediction for each sample, while the mean-teacher (meanteacher) averages model weights at different training iterations to get a teacher model. Deep mutual learning (deepmutual) trains a pool of student models with supervisions from each other. FixMatch (fixmatch) uses the model’s predictions on weakly-augmented images to generate pseudo-labels for the strongly-augmented ones. Unbiased Teacher (unbiased) introduces the teacher-student paradigm to Semi-Supervised Object Detection (SS-OD). Recent works progressive; robust; self-iccv utilize self-training in cross-domain object detection and take the most confident predictions as pseudo labels. MTOR (teacherdetection) uses the mean teacher framework and UMT (umt) adopts distillation and CycleGAN (CycleGAN) in self-training. However, self-training suffers from the problem of confirmation bias (confrimation; curriculum): the performance of the student will be limited by that of the teacher. Although pseudo labels are also used in our proposed D-adapt, they are generated from adaptors that have independent parameters and different tasks from the detector, thereby alleviating the confirmation bias of the overly tight relationship in self-training.
3 Proposed Method
In supervised object detection, we have a labeled source domain , where is the image, is the bounding box coordinates, and is the categories. The detector is trained with , which consists of four losses in Faster RCNN (Faster): the RPN classification loss , the RPN regression loss , the RoI classification loss and the RoI regression loss ,
In cross-domain object detection, there exists another unlabeled target domain that follows different distributions from . The objective of is to improve the performance on .
3.1 D-adapt Framework
To deal with the architecture challenge mentioned in Section 1, we propose the D-adapt framework, which has three steps: (1) decouple the original cross-domain detection problem into several sub-problems (2) design adaptors to solve each sub-problem (3) coordinate the relationships between different adaptors and the detector.
Since adaptation might hurt the discriminability of the detector, we decouple the category adaptation from the training of the detector by introducing a parameter-independent category adaptor (see Figure 1(d)). The adversarial adaptation is only performed on the features of the category adaptor, thus will not hurt the detector’s ability to locate objects. To fill the blank of regression domain adaptation in object detection, we need to perform adaptation on the bounding box regression. Yet feature visualization in Figure 6(c) reveals that features that contain both category and location information do not have an obvious cluster structure, and adversarial alignment might hurt its discriminability. Besides, the common category adaptation methods are also not effective on regression tasks RegDA, thus we decouple category adaptation and the bounding box adaptation to avoid their interfering with each other. Section 3.2 and 3.3 will introduce the design of category adaptor and box adaptor in details. In this section, we will assume that such two adaptors are already obtained.
To coordinate the adaptation on different tasks, we maintain a cascading relationship between the adaptors. In the cascading structure, the later adaptors can utilize the information obtained by the previous adaptors for better adaptation, e.g. in the box adaptation step, the category adaptor will select foreground proposals to facilitate the training of the box adaptor. Compared with the multi-task learning relationship where we need to balance the weights of different adaptation losses carefully, the cascade relationship greatly reduces the difficulty of hyper-parameter selection since each adaptor has only one adaptation loss. Since the adaptors are specifically designed for cross-domain tasks, their predictions on the target domain can serve as pseudo labels for the detector. On the other hand, the detector generates proposals to train the adaptors and higher-quality proposals can improve the adaptation performance (see Table 5 for details). And this enables the self-feedback relationship between the detector and the adaptors.
For a good initialization of this self-feedback loop, we first pre-train the detector on the source domain with . Using the pre-trained , we can derive two new data distributions, the source proposal distribution and the target proposal distribution . Each proposal consists of a crop of the image 222We use uppercase letters to represent the whole image, lowercase letters to represent an instance of object., its corresponding bounding box , predicted category and the class confidence . We can annotate each source-domain proposal with a ground truth bounding box and category label , similar to labeling each RoI in Fast RCNN (Fast), and then use these labels to train the adaptors. In turn, for each target proposal , adaptors will provide category pseudo label and box pseudo label to train the RoI heads,
where is a function that indicates whether it is a foreground class. Note that regression loss is activated only for foreground anchors. After obtaining a better detector by optimizing Equation 2, we can generate higher-quality proposals, which facilitate better category adaptation and bounding box adaptation. This process can iterate multiple times and the detailed optimization procedures are summarized in Algorithm 1.
Note that our D-adapt framework does not introduce any computational overhead in the inference phase, since the adaptors are independent of the detector and can be removed during detection. Also, D-adapt does not depend on a specific detector, thus the detector can be replaced by SSD ssd, RetinaNet retina, or other detectors.
3.2 Category Adaptation
D-adapt decouples category adaptation from the training of the detector to avoid hurting its discriminability. The goal of category adaptation is to use labeled source-domain proposals to obtain a relatively accurate classification of the unlabeled target-domain proposals . Although generic domain adaptation methods, such as DANN (DANN) can be adopted, this task has its own data challenge – the input data distribution doesn’t satisfy the low-density separation assumption well, i.e., the Intersection-over-Union of a proposal and a foreground instance may be any value between 0 and 1 (see Figure 2(a)), which will impede the domain adaptation (RegDA). Recall that in standard object detection, proposals with IoU between and will be removed to discretize the input space and ease the optimization of the classification. Yet it can hardly be used in the domain adaptation problem since we cannot obtain ground truth IoU for target proposals.
The IoU distribution of the proposals from Foggy Cityscapes. When we increase the confidence threshold from 0 to 0.9, undefined proposals (proposals with IoU betweenand ) will decrease. (b) Proposals with lower confidence will be assigned a lower weight in the adaptation to discretize the feature space in a soft way. (c) The discriminator is trained to separate the source-domain proposals from the target-domain proposals for each class independently, while the feature extractor is encouraged to fool .
To overcome the data challenge, we use the confidence of each proposal to discretize the input space in a soft way, i.e., when a proposal has a high confidence being the foreground or background, it should have a higher weight in the adaptation training, and vice versa (see Figure 2(b)
). This will reduce the participation of proposals that are neither foreground nor background and improve the discreteness of the input space in the sense of probability. We also resample background proposals and add them intoand to further increase the discreteness. Then the optimization objective of the discriminator is,
where both the feature representation and the category prediction are fed into the domain discriminator (see Figure 2(c)). This will encourage features aligned in a conditional way CDAN, and thus avoid that most target proposals aligned to the dominant category on the source domain. The objective of the feature extractor is to separate different categories on the source domain and learn domain-invariant features to fool the discriminator,
where is the cross-entropy loss, is the trade-off between source risk and domain adversarial loss. After obtaining the adapted classifier, we can generate category pseudo label for each proposal .
3.3 Bounding Box Adaptation
D-adapt decouples bounding box adaptation from category adaptation to avoid their interfering with each other. The objective of box adaptation is to utilize labeled source-domain foreground proposals to obtain bounding box labels of the unlabeled target-domain proposals . Recall that in object detection, regression loss is activated only for foreground anchor and is disabled otherwise (Fast), thus we only adapt the foreground proposals when training the bounding box regressor. Since the ground truth labels of target-domain proposals are unknown, we use the prediction obtained in the category adaptation step, i.e. .
Following RCNN (RCNN), we adopt a class-specific bounding-box regressor, which predicts the bounding box regression offsets, for each of the foreground classes, indexed by . On the source domain, we have the ground truth category and bounding box label for each proposal, thus we use the smooth loss to train the regressor,
where is the regression prediction, is ground truth category, is the ground truth bounding box offsets calculated from and . However, it’s hard to obtain a satisfactory regressor with on the target domain due to the domain shift.
To tackle this loss challenge, we propose an IoU disparity discrepancy method based on the latest domain adaptation theory (MDD). As shown in Figure 3(a), we introduce an adversarial regressor to maximize its disparity with the main regressor on the target domain. We adopt Generalized Intersection over Union (GIoU) GIoU instead of to define the disparity between the two bounding boxes because GIoU is always bounded between and , while is unbounded and will lead to a numerical explosion during maximization. Then the optimization objective of the adversarial regressor is
Note that GIoU loss on the source domain is only defined on the box corresponding to the ground truth category and that on the target domain is only defined on the box associated with the predicted category . Equation 6 guides the adversarial regressor to predict correctly on the source domain while making as many mistakes as possible on the target domain (Figure 3(b)). Then the feature extractor is encouraged to output domain-invariant features to avoid such cases,
where is the trade-off between source risk and adversarial loss. After obtaining the adapted regressor, we can generate box pseudo label for each proposal .
Following six object detection datasets are used: Pascal VOC(pascal), Clipart (progressive), Comic (progressive), Sim10k (sim10k), Cityscapes (cityscapes) and FoggyCityscapes (foggy). Pascal VOC contains categories of common real-world objects and images. Clipart contains 1k images and shares categories with Pascal VOC. Comic2k contains 1k training images and 1k test images, sharing categories with Pascal VOC. Sim10k has images with bounding boxes of car categories, rendered by the gaming engine Grand Theft Auto. Both Cityscapes and FoggyCityscapes have training images and validation images with 8 object categories. Following swda, we evaluate the domain adaptation performance of different methods on the following four domain adaptation tasks, VOC-to-Clipart, VOC-to-Comic2k, Sim10k-to-Cityscapes, Cityscapes-to-FoggyCityscapes, and report the mean average precision (mAP) with a threshold of .
4.2 Implementation Details
Stage 1: Source-domain pre-training. In the basic experiments, Faster-RCNN (Faster) with ResNet-101 (resnet) or VGG-16 (vgg) as backbone is adopted and pre-trained the on the source domain with a learning rate of for 12k iterations.
Stage 2: Category adaptation. The category adaptor has the same backbone as the detector but a simple classification head. It’s trained for iterations using SGD optimizer with an initial learning rate of , momentum , and a batch size of for each domain. The discriminator is a three-layer fully connected networks following DANN (DANN). is kept for all experiments. is when , when and otherwise.
Stage 3: Bounding box adaptation. The box adaptor has the same backbone as the detector but a simple regression head (two-layer convolutions networks). The training hyper-parameters (learning rate, batch size, etc.) are the same as that of the category adaptor. is kept for all experiments.
Stage 4: Target-domain pseudo-label training. The detector is trained on the target domain for iterations, with an initial learning rate of and reducing to exponentially.
The adaptors and the detector are trained in an alternative way for iterations. We perform all experiments on public datasets using a 1080Ti GPU. Codes will be available at https://github.com/thuml/Transfer-Learning-Library.
4.3 Comparison with State-of-the-Arts
Adaptation between dissimilar domains.
We first show experiments on dissimilar domains using the Pascal VOC Dataset as the source domain and Clipart as the target domain. Table 1 shows that our proposed method outperforms the state-of-the-art method by points on mAP. Figure 4 presents some qualitative results in the target domain. We also compare with Unbiased Teacher (unbiased), the state-of-the-art method in semi-supervised object detection, which generates pseudo labels on the target domain from the teacher model. Due to the large domain shift, the prediction from the teacher detection model is unreliable, thus it doesn’t do well. In contrast, our method alleviates the confirmation bias problem by generating pseudo labels from different models (adaptors).
We also use Comic2k as the target domain, which has a very different style from Pascal VOC and a lot of small objects333Few methods report results on this dataset for its difficulty.. As shown in Table 2, both image-level and instance-level feature adaptation will fall into the dilemma of transferability and discriminability, and do not work well on this difficult dataset. In contrast, our method effectively solves this problem by decoupling the adversarial adaptation from the training of the detector and improves mAP by 7.0 compared with the state-of-the-art.
Adaptation from synthetic to real images.
We use Sim10k as the source domain and Cityscapes as the target domain. Following swda, we evaluate on the validation split of the Cityscapes and report the mAP on car. Table 4 shows that our method surpasses all other methods.
|Method||Backbone||AP on Car|
|Selective DA selective_da||43.0|
|Selective DA selective_da||33.5||38.0||48.5||26.5||39.0||23.3||28.0||33.6||33.8|
Adaptation between similar domains.
We perform adaptation from Cityscapes to FoggyCityscape and report the results444* denotes this method utilizes CycleGAN to perform source-to-target translation. in Table 4. Note that since the two domains are relatively similar, the performance of adaptation is already close to the oracle results.
4.4 Ablation Studies
In this part, we will analyze both the performance of the detector and the adaptors. Denote be the number of proposals of class predicted as class , be the total number of proposals of class , and be the number of classes (including the background), then we use to measure the overall performance of the category adaptor. We use the intersection-over-union between the predicted bounding boxes and the ground truth boxes, i.e., , to measure the performance of the bounding box adaptor. All ablations are performed on VOC Clipart and the iteration is kept 1 for a fair comparison.
Ablation on the category adaptation.
Table 7(a) show the effectiveness of several specific designs mentioned in Section 3.2. Among them, the weight mechanism has the greatest impact, indicating the necessity of the low-density assumption in the adversarial adaptation. To verify this, we assume that the ground truth IoU of each proposal is known, and then we select the proposal with IoU greater than a certain threshold when we train the category adaptor. Table 5 shows that as the IoU threshold of the foreground proposals improves from to , the accuracy of the category adaptor will increase from to , which shows the importance of the low-density separation assumption.
Ablation on the bounding box adaptation.
Table 7(b) illustrates that minimizing the disparity discrepancy improves the performance of the box adaptor and bounding box adaptation improves the performance of the detector in the target domain. The gain brought by box adaptation is consistent, for example when , it can still improve the mAP from to .
Ablation on the training strategy with pseudo-labels.
In Equation 2, losses are only calculated on the regions where the proposals are located, and those anchor areas overlapping with the proposals are ignored. Here, we compare this strategy with the common practice in self-training – filter out bounding boxes with low confidence, then label each proposal that overlaps with these boxes. Although the category labels of these bounding boxes are also generated from the category adaptor, the accuracy of these generated proposals is low (see Table 7(c)). In contrast, our strategy is more conservative and both the on the proposals and the final mAP of the detector are higher.
|Global + Local||37.5||50.4||25.3||28.8||45.0||51.7||45.9||16.9||38.2||31.9||24.2||12.6||26.4||48.7||53.4||44.5||5.5||28.2||45.7||53.5||35.7|
Figure 5 gives the percent of error of each model on VOCClipart following (tide). The main errors in the target domain come from: Miss (ground truth regarded as backgrounds) and Cls (classified incorrectly). Loc (classified correctly but localized incorrectly) errors are slightly less, but still cannot be ignored especially after category adaptation, which implies the necessity of box adaptation in object detection. Category adaptation can effectively reduce the proportion of Cls errors while increasing that of Loc errors, thus it is reasonable to cascade the box adaptor after the category adaptor.
As shown in Tables 7, our method also applies to the one-stage detector RetinaNet (retina) , which improves the mAP by on VOC Clipart.
We visualize by t-SNE (tsne) in Figures 6(a)-6(b) the representations of task VOC Comic2k (6 classes) by category adaptor with and category adaptor with . The source and target are well aligned in the latter, which indicates that it learns domain-invariant features. We also extract box features from the detector and get Figure 6(c)-6(d). We find that the features of the detector do not have an obvious cluster structure, even on the source domain. The reason is that the features of the detector contain both category information and location information. Thus adversarial adaptation directly on the detector will hurt its discriminability, while our method achieves better performance through decoupled adaptation.
5 Discussion and Conclusion
Our method achieved considerable improvement on several benchmark datasets for domain adaptation. In actual deployment, the detection performance can be further boosted by employing stronger adaptors without introducing any computational overhead since the adaptors can be removed during inference. It is also possible to extend the D-adapt framework to other detection tasks, e.g., instance segmentation and keypoint detection, by cascading more specially designed adaptors. We hope D-adapt will be useful for the wider application of detection tasks.