Object detection is a fundamental task in computer vision, which aims to identify and localize objects of interest in an image. In the past decade, remarkable progress has been witnessed for object detection, with the advances of large-scale benchmarks[lin2014microsoft] and modern CNN-based detection frameworks, such as Fast/Faster R-CNN [girshick2015fast, ren2015faster]. However, state-of-the-art detectors require massive training images with bounding box annotations. This limits their generalization ability when facing new environments (i.e., the target domain) where the object appearance, background, and even weather condition significantly differ from the training images (i.e., the source domain). Meanwhile, due to the high cost of box annotations, it is not always feasible to acquire sufficient annotated training images from new environments.
In such situations, unsupervised domain adaptation offers an appealing solution by adapting object detectors from label-rich source domains to unlabeled target domains. Among a large number of methods, a promising manner for domain adaptation is to utilize the domain classifier to measure domain discrepancy, and train the domain classifier and feature extractor in an adversarial way [ganin2015unsupervised, tzeng2017adversarial]. In the literature, adversarial training has been well-studied for domain adaptive image classification [ganin2015unsupervised, ganin2016domain, long2015learning, tzeng2017adversarial], semantic segmentation [hoffman2016fcns, sankaranarayanan2018learning, tsai2018learning] and object detection [chen2018domain, saito2019strong-weak, zhu2019adapting, he2019multi-adversarial].
Among many domain adaptive detection methods, Domain Adaptive (DA) Faster R-CNN [chen2018domain] is the most representative work that integrates Faster R-CNN [ren2015faster] with adversarial training. To address the domain shift problem, it aligns both the image and instance distributions across domains with adversarial training. Recently, DA Faster R-CNN has rapidly evolved into a successful series [saito2019strong-weak, zhu2019adapting, he2019multi-adversarial, hsu2019progressive]. Specifically, Saito et al. [saito2019strong-weak] and Zhu et al. [zhu2019adapting] improved DA Faster R-CNN based on the observation that the plain image-level alignment forces to align non-transferable backgrounds, while the object detection task by nature focuses on local regions that may contain objects of interest. Furthermore, although instance-level alignment can match object proposals in both domains, current practices [chen2018domain, he2019multi-adversarial] lack the ability of identifying the hard aligned instances from excessive low-value region proposals.
Aiming at these issues, we propose a novel categorical regularization framework, which can assist the Domain Adaptive Faster R-CNN series [chen2018domain, saito2019strong-weak] to focus on aligning the crucial regions and important instances cross domains. Thanks to the accurate alignment for such regions and instances, the detection backbone networks can activate objects of interest more accurately in both domains (cf. Figure 1), and thus lead to better adaptive object detection results.
Concretely, our framework consists of two regularization modules, i.e., image-level categorical regularization (ICR), and categorical consistency regularization (CCR) (cf. Figure 2). For image-level categorical regularization, we attach the detection backbone network with an image-level multi-label classifier, and train it with categorical supervisions from the source domain. The classification manner enables the backbone to learn object-level concepts from the holistic images, without being affected by the distribution of non-transferable source backgrounds [zhou2015object, zhou2016learning]. It allows us to implicitly align the crucial regions on both domains at the image level. For categorical consistency regularization, we take into account the consistency between image-level predictions by the attached classifier and instance-level predictions by the detector. We adopt this categorical consistency as a novel regularization factor, and use it to increase the weights of the hard aligned instances in the target domain during instance-level alignment.
The main contributions of this work are three-fold:
We present a novel categorical regularization framework for domain adaptive object detection, which can be applied as a plug-and-play component for the prominent Domain Adaptive Faster R-CNN series. Our framework is cost-free as requiring no further annotations, and also hyperparameter-free for performing on the vanilla detectors.
We design two regularization modules, by exploiting the weakly localization ability of classification CNNs and the categorical consistency between image-level and instance-level predictions. They enable us to focus on aligning object-related regions and hard aligned instances that are directly pertinent to object detection.
We conduct extensive experiments of various domain shift scenarios to validate the effectiveness of our categorical regularization framework. Our framework can significantly boost the performance of existing Domain Adaptive Faster R-CNN detectors [chen2018domain, saito2019strong-weak], and produce state-of-the-art results on benchmark datasets.
2 Preliminaries and Related Work
2.1 CNN-based Object Detection
In the past few years, the rise of deep convolutional neural networks led to a sharp paradigm shift of object detection[liu2019deep]. Among a large number of approaches, the two-stage R-CNN series [girshick2014rich, girshick2015fast, ren2015faster, lin2017feature] have become the mainstream detection framework. The pioneer work, i.e., R-CNN [girshick2014rich], extracts region proposals from the image with low-level vision techniques [uijlings2013selective], and applies a network to classify each region of interest (RoI) independently. Fast R-CNN [girshick2015fast] improves R-CNN by sharing convolutional features among RoIs, and thus enables fast training and inference. Faster R-CNN [ren2015faster]
advances the region proposal generation process with a Region Proposal Network (RPN). RPN shares the feature extraction backbone with the detection head, which in essence is a Fast R-CNN[girshick2015fast]. Faster R-CNN is a famous two-stage detection framework, and is the foundation for many follow-up works [gidaris2015object, dai2016r, lin2017feature]. While recently single-stage detectors have emerged as a popular paradigm [redmon2016you, liu2016ssd, lin2017focal], many top-performing systems still adopt the proven two-stage pipeline [lin2017feature, he2017mask].
Thanks to the flexibility of Faster R-CNN, recently, it is widely adapted for domain adaptive object detection [chen2018domain, saito2019strong-weak, zhu2019adapting, he2019multi-adversarial] with adversarial training [ganin2015unsupervised]. Other approaches, such as self-training [kim2019self-training, roychowdhury2019automatic], are also exploited for domain adaptive object detection in the literature.
2.2 Domain Adaptive Faster R-CNN Series
Domain Adaptive (DA) Faster R-CNN [chen2018domain] is a prominent two-stage object detector for dealing with the challenging domain adaptive object detection problem. It is an intuitive extension of Faster R-CNN [ren2015faster], which aligns both the image and instance distributions by learning domain classifiers in an adversarial manner. For the image-level alignment, the domain classifier is trained on each activation (channel-wise descriptor) from the feature map after the base convolutional layers, while for instance-level alignment, the domain classifier is trained with instance-level RoI features. Furthermore, the consistency between image-level and instance-level domain classifiers is enforced to learn the cross-domain robustness for RPN.
Formally, for a given image, let denote that it is from the source domain while denote that it is from the target domain. Let denote the output of the image-level domain classifier for the activation located at of the feature map, then the image-level alignment loss can be written as
Let denote the output of the instance-level domain classifier for the -th region proposal, then the instance-level alignment loss is as follows
Furthermore, let denote the consistency loss for image-level and instance-level domain classifiers, and let be the original training loss for Faster R-CNN [ren2015faster]. The overall objective for DA Faster R-CNN can be written as
where is a hyper-parameter to balance the detection loss and the domain adaptation components. The adversarial training for adaptation components is implemented by the gradient reverse layer (GRL) [ganin2015unsupervised], where the sign of gradients is flipped when training the base convolutional layers.
As aforementioned, DA Faster R-CNN [chen2018domain] may fail to align the crucial regions and important instances which are crucial for adaptive detection. Meanwhile, it tends to fit the distribution of non-transferable source backgrounds, as the training process involves a large amount of background proposals. Recent works attempted to improve DA Faster R-CNN by replacing the plain image-level alignment model with a weak alignment model [saito2019strong-weak] or a region-level alignment model [zhu2019adapting], and found that the instance-level alignment model is not necessary in presence of other local alignment model [saito2019strong-weak]. We term to these methods collectively as Domain Adaptive Faster R-CNN series.
A high-level diagram of Domain Adaptive Faster R-CNN series is shown in Figure 2 (a), where we follow the paradigm of DA Faster R-CNN [chen2018domain] but omit the part of which is not an essential ingredient in our regularization framework. Please note that Figure 2 (a) is a conceptual diagram, and not all components of the Domain Adaptive Faster R-CNN series strictly follow this structure.
2.3 Weakly Localization by Classification CNNs
It is widely acknowledged that CNNs trained for single-label image classification tend to produce high responses on the local regions containing the main objects [zeiler2014visualizing, zhou2016learning, zhou2015object]. Analogously, CNNs trained for multi-label classification also have the weakly localization ability for the objects associated with image-level categories [wang2016cnn, wei2015hcp].
Taking the Cityscapes [cordts2016cityscapes]
dataset for an example, we collect all instance-level labels into an image-level label vector, and train VGG-16[simonyan2015very] for multi-label image classification. Figure 3 shows the heatmaps for two exampled images from Cityscapes, where the main objects related to image-level categories such as “car”, “person” and “rider” are weakly localized.
3.1 Framework Overview
The overview of our categorical regularization framework is illustrated in Figure 2. In general, our framework improves the DA Faster R-CNN series detectors [chen2018domain, saito2019strong-weak, he2019multi-adversarial] by exploring categorical regularization from two aspects: image-level categorical regularization (ICR) and categorical consistency regularization (CCR). Note that the ICR module does not depend on the CCR module, and thus it can be individually integrated with DA Faster R-CNN detectors which only perform image-level alignment [saito2019strong-weak].
Our framework enables better alignment of crucial regions and important instances across domains. Consequently, the detection backbone produces more accurate activations on objects of interest of both domains (cf. Figure 1), leading to better adaptive detection performance. Our framework is flexible and generalizable – it does not depend on specific algorithms for either image or instance alignment.
3.2 Image-Level Categorical Regularization
Image-level categorical regularization (ICR) is exploited to obtain the sparse but crucial image regions corresponding to categorical information. We achieve this with a weakly supervised solution, which can learn discriminative features for objects of interest, without being affected by the distribution of non-transferable source backgrounds. While the standard training for Faster R-CNN can learn discriminative features for objects of interest, it tends to fit the source backgrounds due to the large amount of background RoIs sampled for training. Since the patterns of source backgrounds are non-transferable, plain image-level alignment may lead to noisy activations in target domains (cf. Figure 1).
In our proposal, as illustrated in Figure 2 (b), we attach the detection backbone with an image-level multi-label classifier, and train it with supervisions from the source domain. Such categorical supervisions are cost-free for detection datasets, and can be easily acquired by collecting all instance-level categories in an image into an image-level categorical vector.
Given the detection backbone network, we perform global average pooling on the output of the last convolutional layer, and feed the pooled features into a plain multi-label classifier implemented by a 11 convolution. We train this image-level classifier with the standard cross-entropy multi-label loss by
where is the total number of categories of a detection dataset, is the ground truth label, and is the predicted one. denotes that there is at least one object of category appearing in this image, while means there is no object of category in the image.
The image-level categorical supervisions encourage the detection backbone to learn category-specific features that can activate object-related regions. This allows us to align the crucial regions of both domains with an image-level alignment model (e.g., Equation (2)). Meanwhile, because there is no background supervision involved in the training process of our image-level multi-label classifier, the risk of fitting (even over-fitting) non-transferable source backgrounds is greatly reduced.
3.3 Categorical Consistency Regularization
We design a categorical consistency regularization (CCR) module to automatically hunt for the hard aligned instances in target domains. Our motivation lies in two aspects. First, current instance alignment models [chen2018domain, he2019multi-adversarial] may be dominated by the excessive low-value background proposals, as they can not identify the hard foreground instances in the target domain. Second, the attached image-level classifier and the instance-level detection head are complementary, because the former exploits the whole image-level context while the latter enjoys more accurate RoI features.
Building upon those above considerations, we adopt the categorical consistency between the image-level and instance-level predictions as a measure for the hardness of classifying a certain target instance. Intuitively, if the image-level classifier predicts that there is no “person” in a target image while the detection head classifies a certain instance as “person”, this instance should be a hard but informative sample for current detection model. Therefore, we utilize this consistency as a regularization factor to increase the weight of hard aligned samples in target domains during instance-level alignment.
Specifically, assume that the detection head classifies the -th instance in a target image as category , we let4), we let denote the image-level estimation of the probability that this image contains objects of category . We define the following distance function to measure the categorical consistency between the instance-level and image-level predictions as
Here the exponent form characterizes the intuition that while a small disagreement may come from the model’s variance, a large disagreement should be attributed to the hardness in classifying this instance.
We use Equation (5) to weight the instance-level adversarial loss, which in implementation is equivalent to weight the gradients passed through the gradient reversal layer (GRL) during training. Take the instance alignment model (i.e., Equation (2)) in DA Faster R-CNN [chen2018domain] for an example, the instance-level alignment loss with CCR can be written as
It is worth noting that, we only apply Equation (5) to weight foreground instances from the target domain, according to the predictions of detection head. We keep the weights for source instances and the background instances from the target domain unchanged (i.e., ), as the former have supervision signals from the source domain, while the latter are not as important as foreground proposals.
3.4 Integration with DA Faster R-CNN Series
In this work, we take the DA Faster R-CNN [chen2018domain] and the state-of-the-art strong-weak aligned Faster R-CNN [saito2019strong-weak] as our baseline detectors. In the following, we term them as “DA-Faster” and “SW-Faster” for simplicity. In fact, other Domain Adaptive Faster R-CNN detectors [he2019multi-adversarial, zhu2019adapting] may also be compatible with our framework with minor modifications.
Integration with DA-Faster.
Integrating our framework with DA-Faster [chen2018domain] is straightforward. We attach an image-level multi-label classifier to the backbone, by adding a global averaging pooling layer and a 11 convolution layer. Furthermore, we use our CCR to weight the gradients passed through the reverse gradient layer (GRL) for instance-level alignment. The modified overall objective of DA-Faster with our regularization framework can be written as
where is set to in [chen2018domain], and our method does not introduce additional hyper parameters.
Integration with SW-Faster.
SW-Faster [saito2019strong-weak] improves the strong image-level alignment model of DA-Faster with a weak global alignment model, and replaces the instance-level alignment model with a strong local alignment model. Since our categorical regularization framework is independent of the specific algorithms for alignment, our ICR module can be straightly integrated into SW-Faster. Furthermore, we add an instance-level alignment model, which is the same to that of DA-Faster, into the pipeline of SW-Faster during training. This allows us to apply our CCR module to further improve SW-Faster. The modified overall objective for SW-Faster with our regularization framework can be written as
where is set to , and and denote the global alignment loss and local alignment loss in [saito2019strong-weak].
4.1 Empirical Setup
|Faster R-CNN (Source)||24.4||30.5||32.6||10.8||25.4||9.1||15.2||28.3||22.0|
|Faster R-CNN (Oracle)||36.2||47.7||53.0||34.7||51.9||41.0||36.8||37.8||42.4|
|Faster R-CNN (Source)||26.9||22.1||44.7||17.4||16.7||-||17.1||18.8||23.4|
|Faster R-CNN (Oracle)||35.3||33.2||53.9||46.3||46.7||-||25.6||29.3||38.6|
Five public datasets are utilized in our experiments, including Cityscapes [cordts2016cityscapes], Foggy Cityscapes [sakaridis2018semantic], BDD100k [yu2018bdd100k], PASCAL VOC [everingham2010pascal], and Clipart1k [inoue2018cross-domain].
Cityscapes [cordts2016cityscapes] focuses on capturing high variability of outdoor street scenes in common weather conditions from different cities. It contains 2,975 training images and 500 validation images with dense pixel-level labels. We transform the instance segmentation annotations into bounding boxes for our experiments.
Foggy Cityscapes [sakaridis2018semantic] is built upon the images in the Cityscapes dataset [cordts2016cityscapes]. This dataset simulates the foggy weather using depth maps provided in Cityscapes with three levels of foggy weather, and thus is suitable to conduct weather adaptation experiments.
BDD100k [yu2018bdd100k] consists of 100k images, with 70k training images and 10k validation images annotated with bounding boxes. We extract a subset of BDD100k with images labeled as daytime, including 36,728 training and 5,258 validation images. We use this subset for scene adaptation experiments.
PASCAL VOC [everingham2010pascal] is a real-world dataset containing 20 categories of common objects with bounding box annotations. Following [saito2019strong-weak], we employ PASCAL VOC 2007 and 2012 training and validation images (16,551 images in total) for experiments.
Clipart1k [inoue2018cross-domain] contains 1k clipart images, which shares the same instance categories with PASCAL VOC but exhibits a large domain shift. We follow the practice in [saito2019strong-weak], and use all images of Clipart1k for both training (without labels) and test.
Baselines and Comparison Methods.
We consider DA-Faster [chen2018domain] and the state-of-the-art SW-Faster [saito2019strong-weak] as our baseline methods, and re-implement them for fair comparisons. Our re-implementations achieve comparable or even better accuracies compared to the original papers. When comparing with other state-of-the-art methods, we report the results from original papers. Furthermore, we also train Faster R-CNN [ren2015faster] only using source images, as well as directly using annotated target images. We refer to models of these two settings as “Faster R-CNN (Source)” and “Faster R-CNN (Oracle)”, respectively.
Following the default settings in [chen2018domain, saito2019strong-weak], all training and test images are resized such that the shorter side has a length of pixels. By default, the backbone models are initialized using pre-trained weights of VGG-16 [simonyan2015very]
on ImageNet, but for the dissimilar domain adaptation experiments from PASCAL VOC[everingham2010pascal] to Clipart1k [inoue2018cross-domain], we follow the practices in [saito2019strong-weak] and use ResNet-101 [he2016deep] as the detection backbone. We fine-tune the network with a learning rate of for 50k iterations and then reduce the learning rate to for another 20k iterations. Each batch is composed of two images, one from source and another from target. The momentum of and the weight decay of is used for VGG-16 based detectors, while for ResNet-101 based detectors, we set the weight decay as . In all experiments, we employ RoIAlign [he2017mask] for RoI feature extraction.
4.2 Comparison Results
In real-world scenarios, object detectors may be applied under different weather conditions. We study the weather adaptation from clear weather to a foggy environment, using Cityscapes’ training set and Foggy Cityscapes’ validation set as the source domain and the target domain, respectively.
Table 1 shows the comparison results. Our categorical regularization framework can consistently boost the performance of DA-Faster and SW-Faster detectors, with 1.4% and 2.6% mAP improvements, respectively. In particular, our CCR module can greatly improve the detection results for some difficult categories such as “train”. It clearly verifies the importance of increasing the weight of hard foreground instances in target domains for instance-level alignment. It is worth noting that our categorical regularization framework helps to reduce the performance gap between the domain adaptive detector and oracle detector trained with annotated target images to about 5% mAP.
|Faster R-CNN (Source)||21.9||42.2||22.9||19.0||30.8||43.1||28.9||10.7||27.4||18.1||13.5||10.3||25.0||50.7||39.0||37.4||6.9||18.1||39.2||34.9||27.0|
|Kim et al. [kim2019self-training]||28.0||64.5||23.9||19.0||21.9||64.3||43.5||16.4||42.2||25.9||30.5||7.9||25.5||67.6||54.5||36.4||10.3||31.2||57.4||43.5||35.7|
Scene layout changes frequently occur in real-life applications of object detection, e.g., automatic driving from one city to another. To study the effectiveness of our regularization framework for scene adaptation, we choose the Cityscapes [cordts2016cityscapes] training set as the source domain and a subset of BDD100k [yu2018bdd100k] as the target domain. In particular, we choose a subset of the BDD100k dataset annotated as daytime to be our target domain and consider the city scene as the adaptation factor, since there only exists daytime data in the Cityscapes dataset. We report the detection results on seven common categories on both datasets.
As shown in Table 2, we observe a significant performance gap between the domain adaptive detectors and the oracle detector, which suggests that scene layout shift is a challenging factor that hinders the performance of domain adaptive detection. Even under this difficult setting, our categorical regularization framework can also improve DA-Faster and SW-Faster by 1.3% and 1.6%, respectively. Similar to the observations on weather adaptation experiments, our CCR module can significantly improve the detection results of some difficult objects such as “truck”.
Dissimilar Domain Adaptation.
Both weather adaptation and scene adaptation can be considered as adaptation between similar domains. We further show experiments on the dissimilar domain adaptation from real images to artistic images. We utilize Pascal VOC [everingham2010pascal] as the real source domain and the Clipart1k [inoue2018cross-domain] as the target domain. Clipart1k contains 1k comical images in total, which have the same 20 categories as PASCAL VOC. Following [saito2019strong-weak], all images in Clipart1k are used for both training (without labels) and testing, and thus there is no oracle detector for this dataset.
As shown in Table 3, for dissimilar domain adaptation, our regularization framework also achieves considerable improvements over the baseline DA-Faster and SW-Faster by 2.0% and 1.5% mAP, respectively. Furthermore, our methods also outperform recent state-of-the-art one-stage object adaptive detector [kim2019self-training] that employs self training for domain adaptation.
4.3 Visualization and Analyses
In Figure 4, we show some detection examples from three target datasets, i.e., Foggy Cityscapes [sakaridis2018semantic], BDD100k [yu2018bdd100k] and Clipart1k [inoue2018cross-domain]. Compared to the baseline SW-Faster [saito2019strong-weak] method, our SW-Faster-ICR-CCR method produces more accurate detection results under complex environments and large domain shifts.
We visualize the image and instance features learned for dissimilar domain adaptation (from PASCAL VOC [everingham2010pascal] to Clipart1k [inoue2018cross-domain]) using -SNE [maaten2008visualizing]. For this experiment, we randomly sample 100 ground truth instances for each category, 50 from the source domain and 50 from the target domain. For some categories that have less than 50 instances in a certain domain, we sample all instances in that domain and the same number of instances from the other domain. The images containing these instances are sampled for image-level visualization. The image features are extracted by applying global average pooling on the output of the detection backbone network, while the instance features are extracted by RoIAlign.
As shown in Figure 5, the blue points represent source samples and the red ones represent target samples. We also show three pairs of instances from different domains, and zoom in to the local regions of the most poorly matched instances. The dissimilar instance pairs of the same category from different domains stay closer in the feature space of our methods. Even for the most poorly matched region, our method still have better alignment performance than the baseline SW-Faster method [saito2019strong-weak]. Furthermore, thanks to the accurate instance-level alignment, our image-level alignment performance is also better than the baseline method.
Besides visualization understanding, we also calculate a quantitative metric for domain distance, where both domains are represented by object instances. For this experiment, we use the same instance samples as the feature visualization experiment. Specifically, we adopt Earth Mover’s Distance (EMD) [rubner2000earth] as the metric for measuring domain distance. With this metric, domain distance computed for SW-Faster [saito2019strong-weak], SW-Faster-ICR and SW-Faster-ICR-CCR are , , , respectively.
The consistency between domain distance and model accuracy verifies the motivation of our work. That is, domain adaptive object detection relies heavily on aligning the crucial local regions and important instances on both domains. Our regularization framework assists the DA Faster R-CNN series to achieve this goal.
In this work, we presented a categorical regularization framework upon Domain Adaptive Faster R-CNN series for improving the adaptive detection performance. Specifically, we exploited the weakly localization ability of multi-label classification CNNs and the categorical consistency between image-level and instance-level predictions, which allows us to focus on aligning object-related local regions and hard aligned instances. In experiments, our framework significantly boosted the performance of existing Domain Adaptive Faster R-CNN detectors and produced state-of-the-art results on public benchmark datasets. Visualization and analyses can validate the effectiveness of our method. In the future, we will investigate how to apply our regularization framework to improve adaptive detectors beyond the Domain Adaptive Faster R-CNN series.