Progressive Domain Adaptation for Object Detection
Recent deep learning methods for object detection rely on a large amount of bounding box annotations. Collecting these annotations is laborious and costly, yet supervised models do not generalize well when testing on images from a different distribution. Domain adaptation provides a solution by adapting existing labels to the target testing data. However, a large gap between domains could make adaptation a challenging task, which leads to unstable training processes and sub-optimal results. In this paper, we propose to bridge the domain gap with an intermediate domain and progressively solve easier adaptation subtasks. This intermediate domain is constructed by translating the source images to mimic the ones in the target domain. To tackle the domain-shift problem, we adopt adversarial learning to align distributions at the feature level. In addition, a weighted task loss is applied to deal with unbalanced image quality in the intermediate domain. Experimental results show that our method performs favorably against the state-of-the-art method in terms of the performance on the target domain.READ FULL TEXT VIEW PDF
Domain shift is unavoidable in real-world applications of object detecti...
Unsupervised domain adaptive object detection aims to learn a robust det...
Images seen during test time are often not from the same distribution as...
For decades, advances in retinal imaging technology have enabled effecti...
Current state-of-the-art object detectors can have significant performan...
Deep learning techniques have enabled the emergence of state-of-the-art
Medical imaging systems are commonly assessed by use of objective image
Progressive Domain Adaptation for Object Detection
Object detection is an important computer vision task aiming to localize and classify objects in images. Recent advancement in neural networks has brought significant improvement to the performance of object detection[9, 24, 21, 22, 23, 17]
. However, such deep models usually require a large-scale annotated dataset for supervised learning and do not generalize well when the training and testing domains are different. For instance, domains can differ in scenes, weather, lighting conditions and camera settings. Such domain discrepancy or domain-shift can cause unfavorable model generalization issues. Although using additional training data from the target domain can improve the performance, collecting annotations is usually time-consuming and labor-intensive.
Unsupervised domain adaptation methods address the domain-shift problem without using ground truth labels in the target domain. Given the source domain annotations, the objective is to align source and target distributions in an unsupervised manner, so that the model can generalize to the target data without annotation effort. Numerous methods are developed in the context of image classification [32, 18, 19, 28, 10, 31, 7, 2], while fewer efforts have been made on more complicated tasks such as semantic segmentation [13, 29] and object detection [11, 3, 15]. Such domain adaptation tasks are quite challenging as there usually exists a significant gap between source and target domains.
In this paper, we aim to ease the effort of aligning different domains. Inspired by 
which addresses the domain-shift problem via aligning intermediate feature representations, we utilize an intermediate domain that lies between source and target, and hence avoid direct mapping across two distributions with a significant gap. Specifically, the source images are first transformed by an image-to-image translation network to have similar appearance as the target ones. We refer to the domain containing synthetic target images as the intermediate domain. We then construct an intermediate feature space by aligning the source and intermediate distributions, which is an easier task than aligning to the final targets. Once this intermediate domain is aligned, we use it as a bridge to further connect to the target domain. As a result, via the proposed progressive adaptation through the intermediate domain, the original alignment between source and target domains is decomposed into two subtasks that both solve an easier problem with a smaller domain gap.
During the alignment process, since the intermediate space is constructed in an unsupervised manner, one potential issue is that each synthetic target image may contribute unequally based on the quality of the translation. To reduce the outlier impact of the low-quality translated images, we propose a weighted version in our adaptation method, where the weight is determined based on the distance to the target distribution. That is, an image closer to the target domain should be considered a more important sample. In practice, we obtain the distance from the discriminator in the image translation model and incorporate it into the detection framework as a weight in the task loss.
We evaluate our method on various adaptation scenarios using numerous datasets, including KITTI , Cityscapes , Foggy Cityscapes  and BDD100k . We conduct experiments on multiple real-world domain discrepancy cases, such as weather changes, camera differences and the adaptation to a large-scale dataset. With the proposed progressive adaptation, we show that our method performs favorably against the state-of-the-art algorithm in terms of accuracy in the target domain. The main contributions of the work are summarized as follows: 1) we introduce an intermediate domain in the proposed adaptation framework to achieve progressive feature alignment for object detection, 2) we develop a weighted task loss during domain alignment based on the importance of the samples in the intermediate domain, and 3) we conduct extensive adaptation experiments under various object detection scenarios and achieve state-of-the-art performance.
Recently, state-of-the-art object detection methods are predominately based on the deep convolutional neural networks (CNNs). These methods can be categorized into region proposal-based and single-shot detectors, depending on the network forwarding pipelines. Region proposal-based methods[9, 24] perform prediction on a variable set of candidate regions. Fast R-CNN  applies selective search  to obtain region proposals, while Faster R-CNN  proposes to learn a Region Proposal Network (RPN) to accelerate the proposal generation process. To further reduce the computational need of proposal generation, single-shot approaches [21, 22, 23, 17] employ a fixed set of predefined anchor boxes as proposals and directly predict the category and offsets for each anchor box. Although these methods achieve state-of-the-art performance, such success hinges on the substantial amount of labeled training data which requires a high labor cost. Also, these methods can overfit on the training domain, which makes them difficult to generalize to many real-world scenarios. As a result, the vision community has recently started showing a great interest in employing domain adaptation techniques to object detection.
Domain adaptation techniques aim to tackle domain-shift between the source and target domains with unlabeled or weakly labeled images in the target domain. In recent years, adversarial learning has played a critical role in domain adaptation methods. Since the emergence of the Domain Adversarial Neural Network (DANN) , numerous works [2, 31, 3] have been proposed to utilize adversarial learning for the feature distribution alignment between two domains. Furthermore, several methods attempt to perform alignment in the pixel space, based on the unpaired image-to-image translation approaches . For image classification, PixelDA  synthesizes additional images in the target domain by learning one-to-many mapping. For semantic segmentation, CyCADA  and AugGAN  both design a CycleGAN -like network to transform images from the source domain to the target one. The transformed images are then treated as simulated training images for the target domain with the same label mapped from the source domain. Instead of performing alignment in the feature/pixel space, Tsai [29, 30] adopt adversarial learning in the structured output space for solving domain adaptation on semantic segmentation.
To address domain adaptation for object detection in a weakly-supervised manner, LSDA  finetunes a fully-supervised classification model for object detection with limited bounding box resources. Alternatively, Naoto  train the network with synthetic data and finetune it with pseudo-labels in the target domain. In an unsupervised domain adaptation setting, Chen  propose to close the domain gap on both image level and instance level via adversarial learning. To emphasize on matching local features, Zhu  mines discriminative regions for alignment, while Saito  focus on aligning local receptive fields at low-level features along with weak alignment on global regions. On the other hand, Kim  utilize image translation network to generate multiple domains and use a multi-domain discriminator to adapt all domains simultaneously, but this method does not consider the distance between the generated ones and the final target.
In this work, we observe that simply applying image translation without knowing the distance between each generated sample and the target domain may result in less effective adaptation. To handle this issue, we first introduce an intermediate domain to reduce the effort of mapping two significantly different distributions and then adopt a two-stage alignment strategy with sample weights to account for the sample quality.
We propose to decompose the domain adaptation problem into two smaller subtasks, bridged by a synthetic domain sitting in between the source and target distribution. Taking advantage of this synthetic domain, we adopt a progressive adaptation strategy which closes the gap gradually through the intermediate domain. We denote the source, synthetic, and target domains as , and , respectively. The conventional adaptation from a labeled domain to the unlabeled domain is denoted as , while the proposed adaptation subtasks are expressed as and . An overview of our progressive adaptation framework is shown in Figure 2. We discuss the details of the proposed adaptation network and progressive learning in the following sections.
In order to align distributions in the feature space, we propose a deep model which consists of two components; a detection network and a discriminator network for feature alignment via adversarial learning.
We adopt the Faster R-CNN  framework for the object detection task, where the detector has a base encoder network to extract image features. Given an image , the feature map is extracted and then fed into two branches: Region Proposal Network (RPN) and Region of Interest (ROI) classifier. We refer to these branches as the detector, which is shown in Figure 2
. To train the detection network, the loss functionis defined as:
where , , and are the loss functions for the RPN, classifier and bounding box regression, respectively. We omit the details of the RPN and ROI classifier here as we focus on solving the domain-shift We omit the details of the RPN and ROI classifier here as we focus on solving the domain-shift problem. The readers are encouraged to refer to the original paper  for further details.
To align the distributions across two domains, we append a domain discriminator after the encoder . The main objective of this branch is to discriminate whether the feature
is from the source or the target domain. Through this discriminator, the probability of each pixel belonging to the target domain is obtained as. We then apply a binary cross-entropy loss to based on the domain label of the input image, where images from the source distributions are given the label and the target images receive label . The discriminator loss function can be formulated as:
Adversarial learning is achieved using the Gradient Reverse Layer (GRL) proposed in  to learn the domain-invariant feature
. GRL is placed in between the discriminator and the detection network, only affecting the gradient computation in the backward pass. During backpropagation, GRL negates the gradients that flow through. As a result, the encoderreceives gradients that force it to update in an opposite direction which maximizes the discriminator loss. This allows to produce features that fools the discriminator while tries to distinguish the domain of the features. For the adaptation task , given source images and target images , the overall min-max loss function of the adaptive detection model is defined as the following:
where is a weight applied to the discriminator loss that balances the loss.
Aligning feature distributions between two distant domains is challenging, and hence we introduce an intermediate feature space to make the adaptation task easier. That is, instead of directly solving the gap between the source and the target domains, we progressively perform adaptation to the target domain bridged by the intermediate domain.
The intermediate domain is constructed from the source domain images to synthesize the target distributions on the pixel-level. We apply an image-to-image translation network, CycleGAN  to learn a function that maps the source domain images to the target ones, and vice versa. Since ground truth labels are only available in the source domain, we only consider the translation from source images to the target domain (i.e., synthetic target images) after training CycleGAN.
Synthetic target images have been utilized to assist with domain adaptation tasks [1, 14, 15] as additionally augmented target training data. Different from these approaches, we define this set of synthetic images as an individual domain to connect the labeled domain with the unlabled domain via adversarial learning. One motivation behind this is that the similarity between source domain and is the image content, only diverging in the visual appearances, while and the target domain are different in image details but have similar distributions on the pixel-level. Consequently, this synthetic domain “sits” in between the source and target domains and thus can help reduce the adaptation difficulty of a large domain gap between and . Figure 3 is one example of feature space visualization using the KITTI and Cityscapes datasets. This figure shows a distribution plot by mapping the features from to a low dimensional 2-D space via t-SNE . The plot demonstrates that in the feature space, the synthetic domain (blue) is located in between the KITTI (red) and Cityscapes (green) distributions.
Our domain adaptation network involves obtaining knowledge from a labeled source domain then map that knowledge to an unlabeled target domain by aligning the two distributions, solving the adaptation task , i.e., via (3) in this paper. To take advantage of the intermediate feature space during alignment, our algorithm decomposes the problem into two stages: and , as shown in Figure 2 a) and b). At the first stage, we use as the labeled domain, adapting to without labels. Due to the underlying similarity between and in image contents, the network focuses on aligning the feature distributions with respect to the appearance difference on the pixel-level. After aligning pixel discrepancies between and , we take as the source domain for supervision and adapts to as stage two in the proposed method. During this step, the model can take advantage of the appearance-invariant features from the first step and focus on adapting the object and context distributions. In summary, the proposed progressive learning separates the adaptation task into two subtasks and pays more attention to individual discrepancies during each adaptation stage.
We observe that the quality of synthetic images differs in a wide range. For instance, some images fail to preserve details of objects or contain artifacts when translated, and these failure cases may have a larger distance to the target distribution (see Figure 4 for an example).
This phenomenon can be also visualized in the feature space in Figure 3, where some blue dots are far away from both the source and target domains.
As a result, when performing supervised detection learning on during , these defects may cause confusions to our detection model, leading to false feature alignment across domains. To alleviate this problem, we propose an importance weighting strategy for synthetic samples based on their distances to the target distribution. Specifically, synthetic outliers that are further away from the target distribution will receive less attention than the ones that are closer to the target domain. We obtain the weights by taking the predicted output scores from the target domain discriminator . This discriminator is trained to differentiate between the source and target images with respect to the target distribution, in which the optimal discriminator is obtained with:
where is the synthetic target image generated via CycleGAN, and and are the probability of belonging to the source and the target domain, respectively. Here, the higher score of represents a closer distribution to the target domain, thus providing a higher weight. On the other hand, lower quality images which are further away from the target domain will be treated as outliers and receive a lower weight. For each image , the importance weight is defined as:
We then apply this weight to the detection loss function in (1) when learning from synthetic images with labels during the second stage. Thus, the final weighted objective function given images and is re-formulated based on (3) as:
In this section, we validate our method by evaluating the performance in three real-world scenarios that result in different domain discrepancies: 1) cross-camera adaptation, 2) weather adaptation, and 3) adaptation to large-scale dataset. Figure 5 shows examples of the detection results from the three tasks before and after applying our domain adaptation method.
For each adaptation scenario, we show a baseline Faster R-CNN result trained on the source data without applying domain adaptation, and a supervised model trained fully on the target domain data (oracle) to illustrate the existing gap between domains. Then we train the proposed model on the selected source and target domain to demonstrate the effectiveness of the proposed method. We also conduct ablation study to analyze the effectiveness of individual proposed components. More results will be available in the supplementary material. All the source code and trained models will be made available to the public111https://github.com/kevinhkhsu/DA_detection.
In our experiments, we adopt VGG16  as the backbone for the Faster R-CNN  detection network, following the setting in . We design the discriminator network using 4 convolution layers with filters of size 3
3. The first 3 convolution layers have 64 channels, each followed by a leaky ReLU with set to 0.2. The final domain classification layer has 1 channel that outputs the binary label prediction. Our synthetic domain is generated by training CycleGAN  on the source and target domain images.
Before applying the proposed adaptation method, we pre-train the detection network using source domain images with ImageNet
pre-trained weights. When training the adaptation model, we use all available annotations in the source domain including the training and validation set. We optimize the network using Stochastic Gradient Descent (SGD) with a learning rate of 0.001, weight decay of 0.0005 and momentum of 0.9. We use
based on a validation set to balance the discriminator loss with the detection loss. Batch size is 1 during training. The proposed method is implemented with Pytorch and the networks are trained using one GTX 1080 Ti GPU with 12 GB memory.
The KITTI dataset  contains images taken while driving in cities, highways, and rural areas. There are a total of 7,481 images in the training set. The dataset is only used as the source domain in the proposed experiments, and we utilize the full training set.
The Cityscapes dataset  is a collection of images with city street scenarios. It includes instance segmentation annotation which we transform into bounding boxes for our experiments. It contains 2,975 training images and 500 validation images. We use Cityscapes with the KITTI dataset in Section 4.3 to evaluate the cross camera adaptation and compare our results with the state-of-the-art method.
As self-explanatory by the name, the Foggy Cityscapes dataset  is built upon the images in the Cityscapes dataset . This dataset simulates the foggy weather using depth maps provided in Cityscapes with three levels of foggy weather. The simulation process can be found in the original paper . Section 4.4 shows the experiments conducted on this simulated dataset for cross weather adaptation.
The BDD100k dataset  consists of 100k images which are split into training, validation, and testing sets. There are 70k training images and 10k validation images with available annotations. This dataset includes different interesting attributes; there are 6 types of weather, 6 different scenes, 3 categories for the time of day and 10 object categories with bounding box annotation. In our experiment, we extract a subset of the BDD100k with images labeled as . It includes 36,728 training and 5,258 validation images. We use this subset to demonstrate the adaptation from a smaller dataset, Cityscapes, to a large-scale dataset using the proposed method in Section 4.5.
Different datasets exhibit distinct characteristics such as scenes, objects, and viewpoint. In addition, the underlying camera settings and mechanisms can also lead to critical differences in visual appearance as well as the image quality. These discrepancies are where the domain-shift takes place. In this experiment, we show the adaptation between images taken from different cameras and with distinctive content differences. The KITTI  and Cityscapes  datasets are used as source and target respectively to conduct the cross camera adaptation experiment. During training, all data in the KITTI training set and raw training images from Cityscapes dataset is used and further evaluated on the Cityscapes validation set. In Table 1, we show experimental results evaluated on the class in terms of the average precision (AP). Compared to the state-of-the-art method  that learns to adapt in the feature space, our baseline denoted as “Ours (w/o synthetic)” matches their performance using our own implementation.
|FRCNN in the wild ||38.5|
|Ours (w/o synthetic)||38.2|
|Ours (synthetic augment)||40.6|
|Cityscapes Foggy Cityscapes|
|FRCNN in the wild ||25.0||31.0||40.5||22.1||35.3||20.2||20.0||27.1||27.6|
|Diversify & Match ||30.8||40.5||44.3||27.2||38.4||34.5||28.4||32.2||34.6|
|Strong-Weak Align ||29.9||42.3||43.5||24.5||36.2||32.6||30.0||35.3||34.3|
|Selective Align ||33.5||38.0||48.5||26.5||39.0||23.3||28.0||33.6||33.8|
|Ours (w/o synthetic)||30.2||37.9||46.1||14.7||26.9||7.0||20.8||31.5||26.9|
|Ours (synthetic augment)||36.6||45.3||55.0||24.2||43.9||18.5||28.4||37.1||36.1|
In order to validate our method, we also conduct ablation studies using several settings. First, we demonstrate the benefit of utilizing information from the synthetic domain. When we directly augment synthetic data in the training set and include them in the source domain to perform feature-level adaptation, denoted as “Ours (synthetic augment)”, there is a 2.1% performance gain compared to . In the proposed method, by adopting our progressive training scheme with the importance weights,we show that our model further improves the AP by 5.4%. In addition, we present the advantage of our weighted task loss in balancing the uneven quality of synthetic images. In Table 2, we show the analysis for using different fixed weights and our importance weighting method. Our method dynamically determines the weight of each image222In this case, the averaged weight obtained from the discriminator is around 0.9. based on the distance from the target distribution. Compared to the one without using any weight (i.e., weight is equal to 1), our importance weight improves the AP by 1.7% and performs better than others that use fixed weights. Overall, we show that our model can reduce the domain-shift problem caused by the camera along with other content differences across two distinct datasets and achieves state-of-the-art performance.
Under real-world scenarios, supervised object detection models can be applied in different weather conditions where they may not have sufficient knowledge of. However, it is difficult to obtain a large number of annotations in every weather condition for the models to learn. This section studies the weather adaptation from clear weather to a foggy environment. The Cityscapes dataset  and the Foggy Cityscapes dataset  are used as the source domain and the target domain, respectively.
Table 3 shows that our method reduces the domain gap across weather conditions and performs favorably against the state-of-the-art methods [3, 25, 37, 16]. When synthetic images are introduced during our progressive adaptation, there is a 10% improvement in mAP compared to the baseline method. We note that the target Foggy Cityscapes dataset fundamentally contain same images as the source Cityscapes dataset, but with synthesized fogs. Thus, the synthetic target domain via image translation is already closely distributed to the target domain and inherits informative labels for the network to learn. Given such information learned from the synthetic domain, both our method and the synthetic augmented one climbs closely to the oracle result. Although the synthetic domain lies close to the target distribution, we show in the results that our progressive training can still assist the adaptation process, improving performance and at the same time generalizing well to different categories. To sum up, this experiment not only demonstrates the adaptation to a foggy weather condition but also shows the capability of using synthetic images to facilitate the distribution alignment process.
|Cityscapes BDD100k daytime|
|Ours (w/o synthetic)||20.4||20.2||49.2||16.6||32.1||27.8||11.9||14.9||0||19.2||21.2|
|Ours (synthetic augment)||23.1||25.3||51.9||15.7||36.0||31.6||12.7||20.8||0||20.2||23.7|
Digital cameras have developed quickly over the years and collecting a large number of images is not a difficult task in the modern world. However, labeling the collected images is a major issue when it comes to building a dataset for supervised learning methods. In this experiment, we examine the adaptation from a relatively smaller dataset to a large unlabeled domain containing distinct attributes. We show that our method can harvest more from existing resources and adapt them to complicated environments. To this end, we use the Cityscapes  and BDD100k  datasets as the source and target domains, respectively. We choose a subset of the BDD100k dataset annotated as to be our target domain and consider the city scene as the adaptation factor, since there only exists daytime data in the Cityscapes dataset.
From the baseline and oracle results shown in Table 4, we can observe the difficulty and the significant performance gap between the source and target domains. Without using the synthetic data, the network has a harder time in adapting to a much diverse dataset with only 0.4% improvement after directly aligning the source and target domains using the method in . When synthetic data is introduced to the source training set, the model learns to generalize better to the target domain and increases the performance by 2.5%. Finally, our method progressively adapts to the target domain by utilizing the intermediate feature space and receives an 3.1% gain in mAP compared to the baseline method . We show in this experiment that our progressive adaptation can squeeze more juice out of the available knowledge and generalize better to a diverse environment, which is a critical issue in real-world applications. Qualitative results are shown in Figure 5 and more results are provided in the supplementary material.
|Before Adaptation||After Adaptation||Ground Truth|
In this paper, we propose a progressive adaptation method that bridges the domain gap using an intermediate domain, decomposing a more difficult task into two easier subtasks with a smaller gap. We obtain the intermediate domain by transforming the source images to target ones. Using this domain, our method progressively solves the adaptation subtasks by first adapting from source to the intermediate domain and then finally to the target domain. In addition, we introduce a weighted loss during stage two of our method to balance different image qualities in the intermediate domain. Experimental results show that our method performs favorably against the state-of-the-art method and can further reduce the domain discrepancy under various scenarios, such as the cross-camera case, weather condition, and adaption to a large-scale dataset.
This work is supported in part by the NSF CAREER Grant #1149783, gifts from Adobe, Verisk, and NEC.
The cityscapes dataset for semantic urban scene understanding. In CVPR, Cited by: §1, §4.2, §4.2, §4.3, §4.4, §4.5.
Deep transfer learning with joint adaptation networks. In ICML, Cited by: §1.